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Real  time  systems  are  becoming  common  and  multiprocessors  are  also  being  used 
more  often.  There  is  an  increasing  need  for  building  priority  synchronization  primi- 
tives for  real-time  systems. 

In  this  regard,  we  have  designed  a  prioritized  spin-lock  mutual  exclusion  algo- 
rithm, the  PR-Lock.  The  PR-Lock  algorithm  is  characterized  by  a  prioritized  lock 
acquisition,  a  low  release  overhead,  very  little  bus-contention,  and  well-defined  se- 
mantics. While  other  prioritized  spin-locks  have  been  proposed,  the  PR-Lock  has 
superior  characteristics. 

Current  methods  for  synchronization  (including  the  PR- Lock)  in  real-time  systems 
are  pessimistic,  and  use  blocking  to  enforce  concurrency  control.  While  protocols  to 
bound  the  blocking  of  high  priority  tasks  exist,  high  priority  tasks  can  still  be  blocked 
by  low  priority  tasks.  In  addition,  these  protocols  require  a  complex  interaction  with 
the  scheduler. 

We  present  a  new  approach  to  synchronization  with  special  applicability  to  embed- 
ded and  real-time  systems.  We  propose  Interruptible  Critical  Sections,  an  optimistic 
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synchronization  mechanism  as  an  alternative  to  purely  blocking  methods.  Practi- 
cal optimistic  synchronization  requires  techniques  for  writing  interruptible  critical 
sections,  and  system  support  for  detecting  critical  section  conflicts. 

We  show  how  Interruptible  Critical  Sections  can  be  used  to  design  algorithms  for 
synchronization  in  a  real-time  system.  These  algorithms  vary  depending  on  the  envi- 
ronment considered  and  the  techniques  used.  Our  experimental  performance  results 
show  that  these  algorithms  reduce  the  variance  in  the  response  time  of  the  highest 
priority  task  with  only  a  small  impact  on  the  performance  of  the  low  priority  tasks. 
We  also  present  an  analysis  which  shows  that  Interruptible  Critical  Sections  improve 
the  schedulability  of  task  sets  that  have  high  priority  tasks  with  tight  deadlines. 

We  extend  the  usage  of  Interruptible  Critical  Sections  to  multiprocessor  systems 
under  real-time  and  non-real  time  environments.  Our  performance  evaluation  shows 
that  the  algorithms  perform  well,  making  the  Interruptible  Critical  Sections  a  feasible 
mechanism  for  synchronization. 
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CHAPTER  1 
INTRODUCTION 


1.1  Goals 


The  broad  objective  of  this  research  was  to  devise  algorithms  and  data  structures 
for  synchronization  on  real-time  systems.  Most  of  the  work  done  in  synchronization  is 
not  based  on  priorities,  and  thus  is  not  suitable  for  real-time  systems.  In  this  regard, 
we  have  designed  a  prioritized  spin-lock  algorithm,  the  PR-Lock.  We  also  present 
a  novel  optimistic  synchronization  mechanism  named  Interruptible  Critical  Sections. 
We  show  how  this  mechanism  can  be  used  to  design  algorithms  for  synchronization 
in  a  real-time  system. 

Real  time  systems  are  becoming  common  and  multiprocessors  are  also  being  used 
more  often.  There  is  an  increasing  need  for  building  priority  synchronization  primi- 
tives for  real-time  systems  [91].  Even  on  a  uniprocessor,  mutual  exclusion  is  necessary 
to  protect  shared  data  in  an  interleaved  thread  schedule. 

There  are  two  areas  of  research  that  this  research  is  trying  to  bridge:  real-time 
systems  and  synchronization  in  uniprocessors  as  well  as  multiprocessors.  Synchro- 
nization in  multiprocessors  includes  concurrency  control  techniques  which  guarantee 
that  a  process  executes  an  entire  code  section  without  being  interrupted  by  another 
process. 

In  real-time  systems,  each  process  has  strict  timing  constraints  and  is  associated 
with  a  priority  indicating  the  urgency  of  that  process  [88].  This  priority  is  used  by 
the  operating  system  to  order  the  rendering  of  services  among  competing  processes. 
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Normally,  the  higher  the  priority  of  a  process,  the  faster  its  request  for  services  gets 
honored. 

Synchronization  controls  access  to  a  shared  resource,  usually  some  data  structure 
shared  between  processes.  Synchronization  on  a  shared-memory  system  of  multipro- 
cessors is  an  important  operation,  since  an  application  process'  speedup  or  throughput 
depends  on  the  operation's  efficiency. 

When  synchronization  primitives  disregard  priorities,  a  lower  priority  process  can 
be  granted  access  to  a  critical  section  among  a  set  of  competing  processes.  Thus, 
lower  priority  processes  may  block  the  execution  of  a  process  with  a  higher  priority 
and  a  stricter  timing  constraint  [80].  This  priority  inversion  may  cause  the  higher 
priority  process  to  miss  its  deadline,  leading  to  the  failure  of  a  real-time  system. 
Priority  Inheritance  and  Priority  Ceiling  protocols  belong  to  a  class  of  protocols 
that  reduces  the  effect  of  priority  inversion.  These  protocols  bound  the  amount  of 
time  a  higher  priority  process  is  blocked  by  a  set  of  lower  priority  processes,  making 
the  execution  time  of  a  process  more  predictable.  However,  the  protocols  require 
prioritized  semaphores. 

Priority  synchronization  algorithms  ensure  that  higher  priority  processes  gain 
access  to  the  critical  section  before  any  of  the  competing  lower  priority  processes. 
For  processes  using  priority  synchronization  algorithms,  more  accurate,  optimistic 
and  predictable  execution  time  estimates  can  be  made.  This,  in  turn,  improves  the 
schedulability  of  a  set  of  processes  in  a  real-time  system. 

There  are  several  software  synchronization  algorithms  in  existence,  usually  based 
on  some  hardware  provided  mechanisms.  Synchronization  algorithms  can  be  classified 
broadly  on  the  approach  they  take:  pessimistic  or  optimistic. 
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1.2  Pessimistic  Synchronization 

In  pessimistic  synchronization  algorithms,  the  use  of  a  shared  resource  is  guarded 
by  a  lock  or  regulated  by  a  queue.  This  approach  views  processes  as  being  in  existence 
for  the  sole  purpose  of  interfering  with  each  other.  Thus,  by  means  of  a  lock,  care 
is  taken  so  that  no  other  process  is  using  the  resource  before  allowing  a  requesting 
process  to  use  it. 

Most  of  the  current  synchronization  mechanisms  are  based  on  spin-locks  [23,  30, 
47,  52,  67,  81,  82],  or  queue-based  locks  [4,  28,  30,  67].  Current  multiprocessor 
hardware  design  includes  read-modify-write  atomic  instructions  that  assist  this  type 
of  synchronization. 

As  mentioned  earlier,  using  these  types  of  locks  for  real-time  systems  presents  the 
problem  of  priority  inversion  because  priority  is  not  taken  into  consideration.  There 
is  a  need  to  modify  or  redesign  these  algorithms  for  use  in  real-time  systems.  Not 
much  work  has  been  done  in  this  direction.  We  discuss  what  has  been  done  so  far 
in  chapter  2.  Having  designed  synchronization  protocols  for  real-time  systems,  there 
remain  the  issues  of  proving  their  correctness,  analyzing  their  performance,  weighing 
their  costs,  and  implementing  the  algorithms. 

1.3  Optimistic  Synchronization 

A  process  using  optimistic  synchronization  algorithms  uses  the  shared  resource 
with  the  assumption  that  no  other  process  may  be  interfering  with  it.  In  the  event 
that  such  a  conflict  occurs,  it  is  detected,  and  the  affected  process  starts  a  recovery 
mechanism.  Recovery  includes  either  re-executing  the  critical  section  all  over  again 
or  preposting  the  computation  for  any  other  process  to  complete. 
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Optimistic  synchronization  algorithms  are  suitable  for  use  in  real-time  systems 
as  they  do  not  cause  priority  inversion,  and  there  is  no  need  to  design  any  priority 
inheritance  protocols. 

Only  a  small  amount  of  research  has  been  done  on  optimistic  algorithms.  Most 
of  the  work  on  optimistic  synchronization  algorithms  is  done  on  uniprocessors  [5, 
11,  73].  Also,  processor  architectures  should  have  suitable  instructions  to  support 
these  algorithms  [36,  42,  90],  and  much  research  needs  to  be  done  to  formulate  and 
evaluate  the  support  needed  [2,  10].  Also,  these  mechanisms  need  to  be  extended  for 
multiprocessors.  Here  again,  there  is  the  issue  of  what  hardware  support  is  needed 
to  implement  these  optimistic  algorithms  efficiently.  We  present  a  discussion  on 
optimistic  synchronization  algorithms  and  related  issues  in  chapter  4. 

1.4    Dissertation  Structure 

The  rest  of  the  dissertation  is  organized  as  follows.  In  chapter  2,  we  present 
a  survey  of  the  research  in  related  areas,  namely,  real-time  scheduling,  priority  in- 
heritance protocols,  and  synchronization  mechanisms  in  multiprocessors.  More  im- 
portantly, we  show  the  effect  on  scheduling  of  not  using  priority  synchronization 
algorithms.  In  Chapter  3,  we  describe  the  current  results  achieved  on  designing 
priority  lock  mechanisms  based  on  pessimistic  synchronization  methods.  Chapter  4 
describes  Interruptible  Critical  Sections  that  rely  on  optimistic  synchronization.  We 
also  present  an  implementation  of  Interruptible  Critical  Sections  for  uniprocessor  sys- 
tems. Interruptible  Critical  Sections  are  extended  to  multiprocessors  in  Chapter  5. 
We  summarize  our  research  in  Chapter  6. 
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1.5  Conclusion 


There  is  a  growing  importance  in  supporting  real-time  systems  on  uniprocessors 
and  multiprocessors.  We  argue  the  necessity  for  designing  priority  synchronization 
algorithms  for  this  purpose.  We  demonstrate  the  viability  of  such  an  effort.  The 
design  effort  is  backed  by  sound  theoretical  and  analytical  reasoning,  and  practical 
implementation  results. 


CHAPTER  2 
RELATED  RESEARCH 


2.1  Introduction 

This  chapter  introduces  previous  research  areas  that  are  affected  by  prioritizing 
synchronization.  Priority  synchronization  algorithms  improve  the  schedulability  of  a 
set  of  processes  in  a  real-time  system.  If  priority  is  disregarded  for  synchronization, 
a  lower  priority  process  may  block  a  higher  priority  process.  In  this  context,  we 
will  study  what  schedulability  is  and  how  it  is  affected  by  blocking.  We  also  present 
Priority  Inheritance  and  Priority  Ceiling  protocols  that  reduce  the  effect  of  blocking. 
We  classify  synchronization  mechanisms  based  on  their  characteristics.  Finally,  we 
present  various,  relevant  synchronization  mechanisms  that  are  currently  in  use. 

2.2    Real  Time  Scheduling 

2.2.1  Introduction 

A  real-time  system  is  one  that  executes  processes  with  time  constraints  [18],  where 
a  process  (or  a  task)  is  defined  as  an  individually  schedulable  entity.  A  hard  real-time 
system  is  a  real-time  system  in  which  every  process's  deadline  is  critical,  i.e.,  the 
system  fails  if  any  one  of  its  processes  does  not  meet  its  deadline.  In  a  soft  real- 
time system,  processes  tolerate  any  possible  delay  in  completion  past  their  deadline. 
Usually,  there  is  a  penalty  associated  with  the  delay  in  completion,  and  longer  the 
delay,  the  greater  is  the  penalty.  Therefore,  the  goal  of  a  hard  real-time  system  is  to 
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avoid  any  failure  state,  and  that  of  the  soft  real-time  system  is  to  minimize  the  total 
penalty.  In  this  discussion,  we  are  not  concerned  with  the  type  of  real-time  system, 
since  our  proposed  algorithms  do  not  distinguish  between  the  two. 

The  fundamental  problem  in  a  real-time  system  is  to  schedule  processes  so  as  to 
maximize  the  number  of  processes  that  meet  their  deadlines.  The  complexity  of  the 
problem  depends  on  the  complexity  of  the  process  model  itself.  A  simplistic  model  is 
one  in  which  we  only  know  about  the  process  priority,  which  is  the  importance  of  the 
process  relative  to  all  other  processes.  At  the  other  end  of  the  spectrum  is  a  model 
of  a  process  with  resource  requirements,  concurrency  constraints  and  input/output 
requirements,  in  addition  to  computation  requirements  and  timing  constraints.  A 
process  can  be  periodic  or  nonperiodic.  A  periodic  process  is  one  which  is  invoked 
at  regular  intervals  of  time.  A  nonperiodic  process  has  an  arbitrary  arrival  time 
and  deadline  and  when  invoked,  is  expected  to  execute  just  once.  Processes  are 
also  distinguished  as  preemptable  or  non-preemptable.  A  process  is  preemptable  if  its 
execution  can  be  interrupted  by  other  processes  at  any  time  and  resumed  afterwards. 
A  process  is  non-preemptable  if  it  must  run  to  completion  once  it  starts.  Process 
scheduling  can  be  classified  according  to  the  underlying  process  model. 

Another  way  to  classify  the  scheduling  problem  as  well  as  the  system  is  when 
information  about  a  process  is  available.  If  knowledge  about  a  process  is  available  a 
priori  then  it  is  a  static  real-time  system  since  all  the  scheduling  decisions  can  be  done 
once  only  during  the  initialization  phase  of  the  system.  As  can  be  imagined,  such  a 
system  is  very  inflexible  and  if  a  new  process  has  to  be  accommodated  at  a  future 
date,  the  whole  system  has  to  be  first  shut  down.  The  advantage  is  that  there  is  no 
overhead  during  runtime.  A  dynamic  approach  determines  schedules  for  processes  on 
the  fly  and  allows  processes  to  be  dynamically  invoked.  Dynamic  approaches  involve 
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higher  run-time  costs,  but  they  are  flexible  and  can  easily  adapt  to  changes  in  the 
environment. 

The  goal  in  designing  real-time  systems  is  predictability  rather  than  speed.  The 
primary  goal  of  scheduling  is  to  meet  the  individual  process  deadlines,  not  to  mini- 
mize, say,  the  response  time.  Although  scheduling  is  used  in  many  areas  like  job  shop 
scheduling,  etc.,  real-time  process  scheduling  is  different  because  of  the  deadlines  of 
each  process. 

Although  scheduling  is  the  main  issue  in  real-time  systems,  there  are  other  con- 
cerns as  well.  Some  of  them  are  provisions  to  include  user  written  device  drivers, 
guaranteed  minimum  interrupt  response  time,  ability  to  tailor  the  system  to  specific 
requirements,  etc.  In  addition,  the  system  should  not  restrict  the  usage  of  hardware 
in  any  undesirable  way. 

2.2.2    Rate  Monotonic  Scheduling 

The  analysis  of  the  schedulability  will  be  performed  by  using  a  theory  of  real- 
time systems  which  is  based  on  rate  monotonic  scheduling  theory.  Rate  monotonic 
scheduling  theory  provides  analytical  mechanisms  for  understanding  and  predicting 
the  execution  timing  behavior  of  real-time  systems.  The  basic  theory,  introduced  in  a 
seminal  paper  written  by  Liu  and  Layland  [59],  gives  us  a  rule  for  assigning  priorities 
to  periodic  process  and  a  formula  for  determining  if  a  set  of  periodic  processes  will 
all  meet  their  deadlines. 

We  assume  the  following  notation.  We  consider  N  periodic  processes  7\,  T2,  T3, 
TN  on  a  uniprocessor.  Let       A,  and  C,  represent  the  execution  time,  the  deadline 
and  the  cycle  time  (periodicity)  of  the  process  Tt.  Assume  that  the  numbering  of  the 
processes  is  such  that  the  following  relationship  holds:  d  <  C2  <  ...  <  CN. 
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The  CPU  Utilization  of  a  process  T;  is  the  ratio  of  a  process's  execution  time  to 
its  period.  The  CPU  utilization  of  a  set  of  processes  is  the  sum  of  the  utilizations  of 
the  individual  processes. 

CPU  Utilization  of  a  Set  of  Processes  U(N)  =  Ei/Ci  +  E2/C2  +  ...  +  EN/CN. 


A  set  of  assumptions  have  been  made  for  the  rate  monotonic  scheduling  theory. 

•  Process  switching  is  instantaneous. 

•  Processes  account  for  all  execution  times. 

•  Process  interactions  are  not  allowed. 

•  Processes  become  ready  to  execute  precisely  at  the  beginning  of  their  periods. 

•  Process  deadlines  are  always  the  start  of  the  next  period. 

The  rate  monotonic  algorithm  assigns  priorities  to  processes  based  on  the  pro- 
cess's cycle  times.  Processes  with  shorter  periods  are  assigned  higher  priorities  and  a 
higher  priority  process  can  preempt  a  lower  priority  process.  The  rate  monotonic  the- 
orem proves  that  a  set  of  N  independent  processes  scheduled  by  the  rate  monotonic 
algorithm  will  always  meet  their  deadlines,  for  all  process  phasings,  if 

Exld  +  E2/C2  +  ...  +  EN/CN  =  U(N)  <  N(2^N  -  1) 

Basically,  if  the  utilization  of  the  process  set  is  less  than  a  theoretically  determined 
bound,  then  the  set  of  processes  is  guaranteed  to  meet  all  of  its  deadlines. 

Given  a  set  of  N  independent  periodic  processes  scheduled  by  the  rate  monotonic 
algorithm,  a  particular  process,  Tk,  k<N,  will  always  meet  its  deadline  if 
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Ei Id  +  E2/d  +  ...  +  EkICk  =  U(k)  <  k(2^k  -  1) 

From  this  result,  it  can  be  seen  that  the  only  factors  that  determine  the  schedula- 
bility  of  process  Tk  are  the  utilization  of  higher  priority  processes  and  the  utilization 
of  the  process  Tjt  itself. 

The  discussion  so  far  assumes  that  processes  always  execute  in  consistence  with 
their  rate  monotonic  priority.  But  in  practice,  because  of  priority  inversion,  a  higher 
priority  process  may  be  blocked  by  a  lower  priority  process  that  is  executing  a  non- 
preemptable  section  of  code.  This  blocking  effect  can  be  included  in  the  previous 
result  as  follows. 

Let  Bk  be  the  worst  case  total  amount  of  blocking  that  a  process  7*  can  incur 
during  any  period.  It  has  been  shown  [80]  that  all  processes  will  meet  their  deadlines 
if 

Ei/d  +  Bt/TtK  1(2^ -1) 

Ei/d  +  E2/d  +  B2/C2  <  2(21'2  -  1) 

Ei/d  +  E2/d  +  -  +  Ek/d  +  BkICk  <  k(2^k  -  1) 

El/d  +  E2/d  +  ...  +  EN/CN  +  BN/CN  <  N{2l'N  -  1) 

The  inequalities  explicitly  show  how  blocking  affects  the  schedulability  of  a  set  of 
processes  and  why  it  is  desirable  to  minimize  blocking. 
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Process  synchronization  is  a  common  source  of  blocking.  When  more  than  one 
process  requires  mutually  exclusive  access  to  a  resource,  processes  must  synchronize. 
If  a  lower  priority  process  has  locked  a  resource  and  is  then  preempted  by  a  higher 
priority  process  that  executes  until  it  needs  to  access  the  same  resource,  the  higher 
priority  process  is  forced  to  wait.  The  higher  priority  process  is  blocked. 

By  using  priority  synchronization  algorithms,  this  blocking  can  be  reduced  or 
avoided,  in  proportion  to  the  priority  of  the  processes,  which  directly  improves  the 
schedulability  of  a  set  of  processes.  The  blocking  as  a  whole  is  not  reduced;  it  is 
shifted  from  higher  priority  processes  to  lower  priority  processes. 

In  section  2.3,  we  discuss  priority  ceiling  protocol  (PCP)  and  priority  inheritance 
protocol  (PIP),  a  class  of  protocols  that  reduce  the  effects  of  blocking  and  also  prevent 
mutual  deadlock. 

2.2.3    Static  Scheduling  Algorithms 

•  Static,  Preemptive  Scheduling  on  a  Uniprocessor  for  arbitrary  tasks. 

Horn  [40]  developed  an  0(n2)  algorithm  for  n  tasks  to  be  scheduled,  based  on 
the  earliest  deadline  policy:  tasks  with  earlier  deadlines  and  earlier  ready  times 
are  chosen  to  run  before  tasks  with  later  deadlines  and  ready  times.  Tasks  can 
have  arbitrary  ready  times  and  deadlines. 

•  Static,  Preemptive  Scheduling  on  a  Multiprocessor  for  arbitrary  tasks. 

Horn  also  described  an  0(n3)  algorithm  to  schedule  n  tasks  with  arbitrary 
ready  times  and  deadlines  on  a  multiprocessor.  His  approach  is  based  on  the 
network  flow  method  and  considers  only  processors  with  the  same  speeds.  This 
approach  is  extended  by  Martel  [64]  by  considering  processors  with  different 
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speeds.  The  complexity  of  Martel's  algorithm  is  0(m2n4  +  n5),  where  m  is  the 
number  of  processors. 

•  Static,  Preemptive  Scheduling  on  a  Uniprocessor  for  periodic  tasks. 

The  algorithm  for  scheduling  arbitrary  tasks  can  be  applied  to  periodic  tasks  by 
considering  the  instances  of  the  periodic  tasks  within  the  time  interval  between 
zero  and  the  least-common-multiple  of  the  task's  periods.  Horn's  and  Martel's 
approach  can  also  be  applied  to  multiprocessor  systems  in  the  same  way. 

The  rate  monotonic  algorithm  described  earlier  assigns  priorities  to  tasks  based 
on  the  task's  cycle  times.  Tasks  with  shorter  periods  are  assigned  higher  pri- 
orities and  a  higher  priority  task  can  preempt  a  lower  priority  task.  The  rate 
monotonic  theorem  [59],  proves  that  a  set  of  N  independent  tasks  scheduled 
by  the  rate  monotonic  algorithm  will  always  meet  their  deadlines,  for  all  task 
phasings,  if 

U{N)  <=  N{2l'N  -  1). 

Teixeira  [93]  presented  a  fixed-priority  assignment  scheme  in  which  he  assumed 
that  the  relative  deadline  of  a  periodic  task  can  be  different  from  the  period  of 
a  task. 

•  Static,  Preemptive  Scheduling  on  a  Multiprocessor  for  periodic  tasks. 

A  partition  approach  is  adopted  to  solve  this  problem.  The  main  idea  is  to 
partition  a  set  of  periodic  tasks  among  a  minimum  number  of  processors  such 
that  each  partition  of  the  periodic  tasks  can  be  scheduled  on  one  processor 
according  to  the  earliest  deadline  scheme  or  the  rate  monotonic  priority  scheme. 
If  the  earliest  deadline  scheme  is  used,  a  bin-packing  packing  algorithm  can 
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be  used  to  determine  a  suboptimal  partition  pattern  of  periodic  tasks  among 
multiple  processors  [20]. 

•  Static  Nonpreemptive  Scheduling. 

Nonpreemptive  scheduling  is  more  difficult  than  preemptive  scheduling  and 
many  nonpreemptive  problems  have  been  shown  to  be  NP-Hard.  Scheduling 
nonpreemptable  tasks  with  arbitrary  ready  times  is  NP-Hard  even  in  unipro- 
cessor systems.  But  for  some  restrictive  problems,  efficient  algorithms  are  avail- 
able. For  example,  the  earliest  deadline  algorithm  has  been  shown  to  be  optimal 
for  scheduling  a  set  of  tasks  with  the  same  ready  times  [72].  Kise  developed  an 
0(n2)  algorithm  for  the  case  in  which  a  task  has  an  earlier  ready  time  if  and 
only  if  it  has  an  earlier  deadline  [49]. 

For  multiprocessor  systems,  a  nonpreemptive  scheduling  problem  is  NP-Hard 
even  when  the  ready  times  and  deadlines  of  tasks  are  the  same  [98].  A  polyno- 
mial optimal  algorithm  is  available  for  scheduling  tasks  with  unit  computation 
time  [86]. 

2.2.4    Dynamic  Scheduling  Algorithms 

Most  algorithms  that  are  optimal  for  static  scheduling  are  not  so  for  dynamic 
scheduling.  For  multiprocessors,  there  can  be  no  optimal  algorithm  for  scheduling 
preemptable  processes  if  the  arrival  time  of  the  processes  are  not  known  in  advance 
[69].  Since  run-time  cost  is  an  important  factor  for  dynamic  scheduling,  most  static 
algorithms  are  not  suitable  for  dynamic  scheduling.  Hence,  there  is  a  need  to  develop 
heuristic  algorithms  for  dynamic  scheduling. 
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For  uniprocessor  systems,  it  was  shown  that  the  earliest  deadline  algorithm  is  opti- 
mal for  scheduling  preemptable  processes  with  arbitrary  arrival  times  [21].  Stankovic 
et  al.  [87]  describe  an  algorithm  that  is  based  on  the  earliest  deadline  policy  but  takes 
into  account  the  run-time  cost.  Baker  and  Su  [7]  compared  four  heuristic  algorithms 
that  schedule  processes  according  to  an  order  determined  by  ready  time,  by  deadline, 
by  the  average  of  ready  time  and  deadline,  and  by  both  ready  time  and  deadline.  It 
has  been  shown  by  them  that  the  last  two  algorithms  perform  better  than  the  first 
two. 

In  multiprocessor  systems,  Mok  et  al.  [69]  have  shown  that  if  a  set  of  all  possible 
processes  that  will  ever  arrive  in  a  system  can  be  scheduled  ahead  of  time,  then  the 
set  of  processes  can  also  be  scheduled  at  run-time.  The  drawback  of  this  approach,  is 
that  the  probability  of  all  possible  arriving  processes  being  scheduled  ahead  of  time 
is  very  low.  They  have  also  proved  that  one  successful  run-time  algorithm  is  the  least 
laxity  algorithm.  Locke  et  al.  [60]  found  that  the  least  laxity  first  and  the  earliest 
deadline  first  are  two  good  heuristic  policies. 

2.3    Priority  Inheritance  Protocols 

In  the  discussion  on  rate  monotonic  scheduling  (section  2.2.2),  we  have  seen  how 
blocking  affects  the  schedulability  of  processes.  Here,  we  give  a  brief  discussion  of 
the  Priority  Inheritance  and  Priority  Ceiling  protocols  that  reduce  blocking  [80]. 

2.3.1    Priority  Inheritance  Protocol 

Priority  inheritance  prevents  a  medium  priority  process  from  prolonging  the  ac- 
tual amount  of  time  that  a  resource  is  locked  by  a  lower  priority  process.  Without 
the  inheritance,  a  medium  priority  process  can  preempt  the  lower  priority  critical 
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section  and  prolong  the  period  of  blocking  of  a  higher  priority  process.  To  avoid  this 
situation,  priority  inheritance  allows  the  lower  priority  process  to  inherit  the  blocked 
process's  higher  priority  for  the  duration  of  the  critical  section.  Thus,  priority  in- 
heritance prevents  the  medium  priority  process  from  preempting  the  critical  section, 
that  is  now  executing  at  a  high  priority. 

However,  this  basic  priority  inheritance  mechanism  has  a  drawback,  in  that,  if 
a  process  shares  m  resources  with  lower  priority  processes,  then  it  can  be  blocked 
at  most  m  times  per  execution  period  due  to  process  synchronization.  This  can  be 
illustrated  by  an  example.  Suppose  a  high  priority  process  requires  data  from  several 
resources  that  are  all  currently  locked.  The  low  priority  process  locks  a  resource;  it 
is  then  preempted  by  a  slightly  higher  priority  process  that  in  turn  locks  another 
resource,  and  so  on.  The  blocking  process  inherits  the  blocked  process's  priority  and, 
after  the  critical  section  is  completed,  will  relinquish  the  resource.  The  high  priority 
process  will  use  the  resource  and  then  be  forced  to  wait  for  the  second  resource  that 
it  needs  and  so  on.  Thus,  the  high  priority  process  can  wait  at  most  m  times  for  the 
m  resources  that  it  needs. 

2.3.2    Priority  Ceiling  Protocol 

This  protocol  solves  the  above  problem,  in  that  a  high  priority  process  waiting 
for  m  shared  resources  waits  at  most  once  per  period,  for  the  duration  of  the  entire 
critical  section.  In  this  mechanism,  associated  with  each  resource,  in  addition  to  the 
semaphore  or  monitor  that  protects  it,  is  an  attribute,  known  as  the  priority  ceiling. 
This  is  the  highest  priority  at  which  a  critical  section  associated  with  that  resource 
can  be  executed,  i.e.,  the  highest  priority  of  a  process  that  can  use  the  resource.  Thus, 
if  a  high  priority  process  wishes  to  use  m  resources,  the  ceiling  of  those  resources  is 
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set  to  the  priority  of  this  process.  So,  a  medium  priority  process  can  never  preempt 
a  lower  priority  process  in  its  critical  section.  Thus,  this  rule  allows  only  one  process 
to  have  locks  at  any  given  time.  Hence,  the  high  priority  process  is  blocked  at  most 
once  per  period  of  execution. 

2.3.3  Priority  Ceiling  Protocols  with  Abortable  Critical  Sections 

Priority  inversion  of  high  priority  tasks  is  reduced  further  by  selectively  aborting 
a  low  priority  task  executing  within  a  critical  section.  Shu  et  al.  [84]  proposed 
an  Abort  Ceiling  Protocol,  an  extension  to  the  Priority  Ceiling  Protocol.  In  this 
algorithm,  an  abort  ceiling  priority  is  associated  with  a  task.  The  abort  ceiling  comes 
into  effect  when  the  task  is  executing.  Another  task  may  abort  the  currently  running 
task  and  run  immediately  if  its  priority  is  higher  than  the  current  abort  ceiling.  The 
protocol  relies  on  the  Interruptible  Critical  Sections  (Chapter  4)  to  restart  the  critical 
section  of  the  aborted  task.  Also,  the  protocol  assumes  static  priorities.  The  Ceiling 
Abort  Protocol  [92]  proposed  by  Takada  and  Sakamur  is  a  similar  extension  to  the 
Priority  Ceiling  Protocol.  This  protocol  assigns  an  abort  ceiling  priority  to  the  critical 
section  instead.  Also,  the  critical  section  is  divided  into  abortable  and  non-abortable 
segments.  The  issue  here  is  to  minimize  abortion  and  re-execution  overheads. 

2.3.4  Priority  Inheritance  Protocols  for  Multiprocessors 

Priority  inheritance  protocols  have  been  extended  for  multiprocessors  by  Rajku- 
maretal.  in  [79].  In  the  case  of  multiprocessors,  the  concept  of  blocking  is  generalized 
to  include  remote  blocking.  When  a  process  executing  on  one  processor  has  to  wait 
for  the  execution  of  a  process  assigned  to  another  processor,  it  is  said  to  experience 
remote  blocking. 
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By  its  very  nature,  remote  blocking  is  very  different  from  uniprocessor  blocking: 
remote  blocking  is  a  function  of  execution  times  of  processes  on  other  processors  even 
in  the  absence  of  data-sharing.  Thus,  uniprocessor  priority  inheritance  protocols  are 
to  be  enhanced  for  multiprocessors. 

In  the  multiprocessor  priority  ceiling  protocol,  tasks  are  assumed  to  be  bound  to 
a  processor.  Static  binding  is  found  to  perform  better  in  static  as  well  as  dynamic 
priority  scheduling  algorithms.  A  critical  section  executed  by  processes  on  different 
processors  is  called  a  global  critical  section  (GCS).  A  processor  which  executes  global 
critical  sections  is  called  a  synchronization  processor  and  processors  which  run  appli- 
cation processes  only  are  called  application  processors.  A  synchronization  processor 
can  also  be  running  application  tasks. 

The  priority  ceiling  of  a  semaphore  S  indicates  the  maximum  priority  at  which 
a  critical  section  guarded  by  this  semaphore  can  execute.  The  priority  ceiling  of  a 
local  semaphore  S  is  defined  to  be  the  priority  of  the  highest  priority  process  that 
may  lock  the  semaphore. 

Let  the  priority  of  the  highest  priority  process  that  accesses  a  global  semaphore 
GS  be  denoted  by  PS.  Then  the  priority  ceiling  of  a  global  semaphore  GS  is  defined 
such  that 

•  The  priority  ceiling  of  GS  is  higher  than  PS. 

•  If  GSi  and  GSj  are  global  semaphores  and  PSi  >  PSj,  then  the  priority  ceiling 
of  GSi  is  greater  than  the  priority  ceiling  of  GSj. 

The  multiprocessor  priority  ceiling  protocol  that  is  used  in  each  of  the  application 
processors  and  the  synchronization  processors  is  as  follows. 
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•  Each  application  processor  runs  the  priority  ceiling  protocol  on  the  set  of  pro- 
cesses that  it  runs,  and  to  the  set  of  local  semaphores  bound  to  the  application 
processor. 

•  Each  global  critical  section  on  the  synchronization  processor  normally  executes 
at  its  assigned  priority. 

•  The  synchronization  processor  runs  the  priority  ceiling  protocol  on  the  global 
critical  sections,  the  set  of  application  tasks,  and  the  set  of  global  and  local 
semaphores  bound  to  the  synchronization  processor. 

The  multiprocessor  priority  ceiling  protocol  prevents  deadlocks,  and  bounds  the 
blocking  duration  of  each  process  as  a  function  of  the  critical  section  duration  of 
other  tasks. 

2.4    Synchronization  in  Multiprocessors 

Synchronization  primitives  make  programs  easier  to  understand  and  write,  but 
processors  waste  time  when  waiting  for  locks.  Synchronization  primitives  are  used 
in  nearly  every  parallel  program,  and  lessening  synchronization  delays  is  a  major 
goal  for  efficient  parallel  program  execution.  We  will  study  synchronization  both 
in  the  context  of  pessimistic  and  optimistic  approaches.  We  also  will  briefly  review 
synchronization  mechanisms  in  message  passing  multiprocessor. 

2.4.1    Pessimistic  Synchronization 


Pessimistic  synchronization  mechanisms  are  overly  restrictive  in  their  approach. 
Even  if  there  is  no  interference  with  other  processes  while  sharing  a  resource,  a  process 
incurs  the  overhead  of  establishing  the  lock  for  itself  before  using  the  shared  resource. 
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Mechanisms  include  hardware  operations  like  Test&Set,  Compare&Swap,  etc.,  low 
level  primitives  like  spin-locks,  condition  variables,  etc.,  and  high  level  primitives  like 
semaphores,  monitors,  etc.  We  restrict  our  discussion  to  low  level  synchronization 
primitives  only. 

Hardware  Synchronization  Primitives 

Hardware  synchronization  primitives  evolved  primarily  on  shared-memory  mul- 
tiprocessors. Atomic,  sequentially  consistent  loads  and  stores  can  be  used  to  build 
synchronization  primitives.  Hardware  primitives  include  Test&Set,  Fetch&Store, 
Fetch&Add,  and  Compare&Swap.  These  primitives  are  also  called  read-modify-write 
operations. 

Test&Set  and  reset  are  a  set  of  basic  synchronization  primitives  [23].  Test&Set 
is  repeatedly  executed  to  get  exclusive  access  to  a  lock  variable  before  entering  a 
mutual  exclusion.  Lock  reset  is  used  to  exit  from  the  section.  Because  the  processor 
repeatedly  tests  a  lock  until  it  is  acquired,  this  may  cause  excessive  network  traffic. 
In  contrast,  a  suspend-lock  employs  interprocessor  interrupts.  A  processor  waits  for 
an  interrupt  if  its  first  Test&Set  fails. 

In  another  scheme  [47],  a  full/empty  tag  is  associated  with  each  word  in  shared- 
memory.  Less  general  than  read-modify-write,  this  tag  is  tested  before  a  producer- 
consumer  write  or  read  operation.  Only  a  full  word  can  be  read  and  only  an  empty 
word  can  be  written.  When  the  test  succeeds,  the  read  or  write  operation  is  performed 
and  the  value  of  the  tag  is  reversed. 

An  extension  to  the  Test&Set,  the  Test&Test&Set  repeatedly  tests  a  local  copy 
of  the  lock  whenever  the  first  atomic  Test&Set  of  the  global  lock  fails  [82].  The 
local  copies  are  invalidated  when  the  global  lock  is  reset.  Every  waiting  processor 
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does  another  Test&Test&Set  operation.  Only  one  process  will  get  the  lock.  This 
scheme  reduces  the  network  traffic  associated  with  the  Test&Set.  Anderson  [4]  found 
that  exponential  backoff  after  the  first  Test&Test&Set  failure  is  effective  in  reducing 
contention  among  processes  while  acquiring  a  lock. 

Fetch&Op  primitives  include  Fetch&Store  (swap)  and  Fetch&Add  [52].  The  later 
primitive  provides  for  adding  an  increment  to  a  shared  sum. 

Compare&Swap  [52]  compares  the  contents  of  a  memory  location  against  a  given 
value,  and  sets  a  condition  code  to  indicate  whether  they  are  equal.  If  so,  it  replaces 
the  contents  of  the  memory  with  a  second  given  value.  Herlihy  [34]  showed  that 
the  Compare&Swap  operation  is  more  powerful  than  the  rest  of  the  operations  listed. 
Herlihy  showed  that  Compare&Swap  can  be  used  to  convert  any  sequential  data  object 
into  a  concurrent  wait-free(  2.4.2)  data  object. 

Spin  Locks 

Spin  locks  are  busy-wait  constructs  in  which  processes  repeatedly  test  shared 
variables  to  determine  when  the  processes  proceed.  Busy-waiting  is  preferred  over 
scheduler-based  blocking  when  scheduler  overhead  exceeds  expected  waiting  time, 
when  processor  resources  are  not  needed  for  other  processes,  or  when  scheduler-based 
blocking  is  inappropriate  or  impossible,  for  example,  in  the  kernel  of  an  operating 
system. 

The  simplest  mutual  exclusion  lock  employs  a  polling  loop  to  access  a  boolean 
flag  that  indicates  whether  the  lock  is  free.  Each  processor  repeatedly  executes  a 
Test&Set  instruction  in  an  attempt  to  change  the  flag  from  false  to  true,  thereby 
acquiring  the  lock.  A  processor  releases  the  lock  by  setting  it  to  false.  To  reduce 
network  traffic,  Test&Test&Set  and  exponential  backoff  may  be  employed. 
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A  ticket  lock  [81]  reduces  the  number  of  atomic  operations  to  one  per  lock.  A 
ticket  lock  consists  of  two  counters,  one  containing  the  number  of  requests  to  acquire 
the  lock,  and  the  other  the  number  of  times  the  lock  has  been  released.  A  processor 
acquires  the  lock  by  performing  a  Fetch&Increment  operation  on  the  request  counter 
and  waiting  until  the  result  is  equal  to  the  value  of  the  release  counter.  It  releases  the 
lock  by  incrementing  the  release  counter.  Processes  acquire  the  lock  in  FIFO  order 
of  their  requests. 

The  ticket  lock  can  still  cause  substantial  memory  and  network  contention  through 
polling  of  a  common  location.  Thus,  it  is  not  possible  to  obtain  a  lock  with  an  ex- 
pected number  of  network  transactions,  due  to  the  unpredictability  of  the  length 
of  the  critical  sections.  Anderson  [4]  and  Graunke  and  Thakkar  [30]  have  proposed 
locking  algorithms  that  achieve  the  constant  bound  on  the  number  of  remote  memory 
operations  in  cache- coherent  multiprocessors.  Each  processor  uses  an  atomic  opera- 
tion to  obtain  the  address  of  a  location  on  which  to  spin.  Each  processor  spins  on 
a  different  location  in  a  different  cache  line.  This  array-based  queuing  lock  guaran- 
tees FIFO  ordering  of  requests  and  require  space  per  lock  linear  in  the  number  of 
processors. 

Mellor-Crummey  and  Scott  [66]  devised  a  list-based  queuing  lock,  the  MCS-Lock. 
It  requires  atomic  Fetch&Store  and  Compare&Swap  instructions.  The  lock  variable 
maintains  the  tail  of  a  FIFO  queue  and  the  head  of  the  queue  is  maintained  by  the 
process  using  the  lock.  The  acquire  operation  is  accomplished  with  a  Fetch&Store 
operation  on  the  lock  variable  and  the  release  by  a  Compare&Swap.  This  lock  guaran- 
tees FIFO  ordering  of  lock  acquisitions,  processes  spin  on  locally- accessible  flag  vari- 
ables only  and  requires  a  constant  amount  of  space  per  lock.  Markatos  [63]  designed 
a  priority  spin-lock  algorithm  based  on  the  MCS-Lock.  Craig  [19]  refined  Marktos's 
approach  to  achieve  FIFO  and  priority  locks  with  better  space  complexities  in  case 
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of  nested  lock  acquisitions.  In  chapter  3,  we  discuss  our  own  implementation  of  a 
priority  spin-lock,  the  PR-Lock.  Our  PR-Lock  has  better  acquire  and  release  lock 
characteristics,  and  differs  from  the  above  cited  locks  in  some  important  details. 

Condition  Variables 

Condition  variables  [85]  allow  conditional  blocking  inside  a  critical  section.  Con- 
dition variables  can  be  used  to  implement  monitors  and  are  provided  in  Mach  [1] 
operating  system.  The  operations  using  condition  variable  include  condition-wait 
and  condition-signal.  A  condition  variable  is  associated  with  a  mutex  variable. 

When  a  process  performs  the  condition-wait  operation  on  a  condition  variable, 
the  associated  mutex  variable  is  unlocked  and  the  calling  process  is  blocked.  When 
another  process  executes  the  condition-signal  operation  on  the  same  condition  vari- 
able indicating  that  the  condition  may  have  changed,  the  associated  mutex  variable 
is  locked  and  the  blocked  process  continues.  The  unblocked  process  must  re-evaluate 
the  condition  before  proceeding  further. 

2.4.2    Optimistic  Synchronization 

In  these  types  of  synchronization  mechanisms,  processes  use  a  shared  resource 
with  the  optimism  of  non-interference.  On  the  other  hand,  care  has  to  be  taken  if 
there  is  a  conflict.  In  Interruptible  Critical  Sections,  the  affected  process  recovers  and 
restarts  the  critical  section  from  the  beginning  [11].  This  is  the  subject  of  discussion 
in  chapter  4. 

Lock-free  synchronization  was  introduced  by  Lamport  [53].  Lock-free  data  struc- 
tures can  be  further  classified  as  non-blocking  and  wait-free  [34].  Nonblocking  algo- 
rithms guarantee  that  some  process  accessing  the  data  structure  will  complete  an 
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operation  in  a  finite  number  of  steps.  Wait-free  algorithms  ensure  that  all  processes 
complete  their  operation  within  a  fixed  number  of  steps. 

Herlihy  [35]  has  shown  that  it  is  impossible  to  construct  non-blocking  implemen- 
tations of  arbitrary  concurrent  objects  with  any  combination  of  read,  write,  and 
Fetch&Op  (where  Op  can  be  Store,  Increment,  Add,  etc.)  when  the  number  of  pro- 
cesses being  considered  are  greater  than  two.  However,  there  are  some  universal 
atomic  operations  which  are  capable  of  implementing  arbitrary  non-blocking  objects, 
Compare&Swap  being  one.  Methods  for  automatically  converting  a  sequential  im- 
plementation of  an  abstract  data  type  into  a  wait-free  implementation  was  given  by 
Herlihy,  and  into  a  non-blocking  implementation  by  Prakash  et  al.  [78],  and  Turek 
et  al.  [97].  The  methodology  proposed  by  Turek  et  al.  also  handles  wait-free  imple- 
mentation, uses  less  memory  and  accomodates  greater  concurrency. 

2.4.3    Synchronization  through  Message  passing 

Another  way  processes  can  communicate  and  synchronize  is  through  message 
passing.  Message  passing  is  a  form  of  synchronization,  since  a  message  can  be  received 
only  after  it  has  been  sent.  Remote  Procedure  Calls  and  Rendezvous  are  higher  level 
forms  of  synchronization  using  message  passing.  In  the  context  of  real-time  systems, 
Goscinski  [29]  developed  two  algorithms  for  mutual  exclusion  in  distributed  systems. 
Johnson  and  Newman- Wolfe  [46]  proposed  a  distributed  priority  lock  based  on  the 
PR-Lock  (chapter  3)  that  has  low  storage  and  overhead  requirements. 

2.5  Conclusion 


In  this  chapter  we  introduced  some  of  the  issues  in  real-time  systems  including 
scheduling,  priority  inversion,  and  synchronization.  We  presented  some  of  the  current 
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practices  in  scheduling  processes  in  single  as  well  as  multiple  processors.  We  presented 
an  analysis  of  the  rate-monotonic  scheduling  algorithm  and  the  effect  of  blocking  due 
to  priority  inversion.  We  also  discussed  two  protocols  for  reducing  the  effect  of 
blocking,  namely,  priority  inheritance  and  priority  ceiling. 

We  categorized  synchronization  into  two  types:  pessimistic  and  optimistic.  We 
cited  many  examples  and  techniques  illustrating  the  two  mechanisms.  Not  all  of 
the  specific  mechanisms  are  suitable  under  a  given  environment:  some  can  be  more 
efficient  than  others. 

Optimistic  synchronization  algorithms  were  not  previously  applied  to  real-time 
systems.  We  show  how  optimisitic  synchronization  can  be  effectively  used  for  real- 
time systems,  thereby  avoiding  the  priority  inversion  problem  that  is  inherent  in  the 
lock-based  synchronization  mechanisms. 


CHAPTER  3 

A  PRIORITIZED  MULTIPROCESSOR  SPIN  LOCK 


3.1  Introduction 

Mutual  exclusion  is  a  fundamental  synchronization  problem  for  exclusive  access 
to  critical  sections  or  shared  resources  on  multiprocessors  [62].  The  spin-lock  is 
one  of  the  mechanisms  that  can  be  used  to  provide  mutual  exclusion  on  shared- 
memory  multiprocessors  [6].  A  spin-lock  usually  is  implemented  using  atomic  read- 
modify-write  instructions  such  as  Test&Set  or  Compare&Swap,  which  are  available  on 
most  shared-memory  multiprocessors  [52].  Busy  waiting  is  effective  when  the  critical 
section  is  small  and  the  processor  resources  are  not  needed  by  other  processes  in  the 
interim.  However,  a  spin-lock  is  usually  not  fair,  and  a  naive  implementation  can 
severely  limit  performance  due  to  network  and  memory  contention  [4,  27].  A  careful 
design  can  avoid  contention  by  requiring  processes  to  spin  on  locally  stored  or  cached 
variables  [66]. 

In  real-time  systems,  each  process  has  timing  constraints  and  is  associated  with  a 
priority  indicating  the  urgency  of  that  process  [88].  This  priority  is  used  by  the  oper- 
ating system  to  order  the  rendering  of  services  among  competing  processes.  Normally, 
the  higher  the  priority  of  a  process,  the  faster  its  request  for  services  gets  honored. 
When  the  synchronization  primitives  disregard  the  priorities,  lower  priority  processes 
may  block  the  execution  of  a  process  with  a  higher  priority  and  a  stricter  timing  con- 
straint [79,  80].  This  priority  inversion  may  cause  the  higher  priority  process  to  miss 
its  deadline,  leading  to  a  failure  of  the  real-time  system.  Most  of  the  work  done  in 
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synchronization  is  not  based  on  priorities,  and  thus  is  not  suitable  for  real-time  sys- 
tems. Furthermore,  general  purpose  parallel  processing  systems  often  have  processes 
that  are  "more  important"  than  others  (kernel  processes,  processes  that  hold  many 
locks,  etc.).  The  performance  of  such  systems  will  benefit  from  prioritized  access  to 
critical  sections. 

In  this  chapter,  we  present  a  prioritized  spin-lock  algorithm,  the  PR-Lock.  The 
PR- Lock  algorithm  is  suitable  for  use  in  systems  which  either  use  static-priority  sched- 
ulers, or  use  dynamic-priority  schedulers  in  which  the  relative  priorities  of  existing 
tasks  do  not  change  while  blocked  (such  as  Earliest  Deadline  First  [88]  or  Minimum 
Laxity  [39]).  The  PR-Lock  is  a  contention-free  lock  [66],  so  its  use  will  not  create 
excessive  network  or  memory  contention.  The  PR- Lock  maintains  a  queue  of  records, 
with  one  record  for  each  process  that  has  requested  but  not  yet  released  the  lock. 
The  queue  is  maintained  in  sorted  order  (except  for  the  head  record)  by  the  acquire 
lock  operations,  and  the  release  lock  operation  is  performed  in  constant  time.  As 
a  result,  the  queue  order  is  maintained  by  processes  that  are  blocked  anyway,  and 
a  high  priority  task  does  not  perform  work  for  a  low  priority  task  when  it  releases 
the  lock.  The  lock  keeps  a  pointer  to  the  record  of  the  lock  holder,  which  aids  in 
the  implementation  of  priority  inheritance  protocols  [79,  80].  A  task's  lock  request 
and  release  are  performed  at  well-defined  points  in  time,  which  makes  the  lock  pre- 
dictable. We  present  a  correctness  proof,  and  simulation  results  which  demonstrate 
the  prioritized  lock  access,  the  locality  of  the  references,  and  the  improvement  over 
a  previously  proposed  prioritized  spin-lock. 

We  organize  this  chapter  as  follows.  In  Section  3.1.1  we  describe  previous  work 
in  this  area  and  in  Section  3.2,  we  present  our  algorithm.  In  Section  3.3  we  argue 
the  correctness  of  our  algorithm.  In  Section  3.4  we  discuss  an  extension  to  the  algo- 
rithm presented  in  Section  3.2.  In  Section  3.5  we  show  the  simulation  results  which 
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compare  the  performance  of  the  PR-Lock  against  that  of  other  similar  algorithms. 
In  Section  3.6  we  conclude  this  chapter  by  suggesting  some  applications  and  future 
extensions  to  the  PR-Lock  algorithm. 

3.1.1    Previous  Work 

Our  PR-Lock  algorithm  is  based  on  the  MCS-Lock  algorithm,  which  is  a  spin-lock 
mutual  exclusion  algorithm  for  shared-memory  multiprocessors  [66].  The  MCS-Lock 
grants  lock  requests  in  FIFO  order,  and  blocked  processes  spin  on  locally  accessible 
flag  variables  only,  avoiding  the  contention  usually  associated  with  busy-waiting  in 
multiprocessors  [4,  27].  Each  process  has  a  record  that  represents  its  place  in  the 
lock  queue.  The  MCS-Lock  algorithm  maintains  a  pointer  to  the  tail  of  the  lock 
queue.  A  process  adds  itself  to  the  queue  by  swapping  the  current  contents  of  the 
tail  pointer  for  the  address  of  its  record.  If  the  previous  tail  was  nil,  the  process 
acquired  the  lock.  Otherwise,  the  process  inserts  a  pointer  to  its  record  in  the  record 
of  the  previous  tail,  and  spins  on  a  flag  in  its  record.  The  head  of  the  queue  is  the 
record  of  the  lock  holder.  The  lock  holder  releases  the  lock  by  reseting  the  flag  of  its 
successor  record.  If  no  successor  exists,  the  lock  holder  sets  the  tail  pointer  to  nil 
using  a  Compare&Swap  instruction. 

Molesky,  Shen,  and  Zlokapa  [71]  describe  a  prioritized  spin-lock  that  uses  the 
test-and-set  instruction.  Their  algorithm  is  based  on  Burn's  fair  test-and-set  mutual 
exclusion  algorithm  [14].  However,  this  lock  is  not  contention-free. 

Markatos  and  LeBlanc  [63]  presents  a  prioritized  spin-lock  algorithm  based  on 
the  MCS-Lock  algorithm.  Their  acquire  lock  algorithm  is  almost  the  same  as  the 
MCS  acquire  lock  algorithm,  with  the  exception  that  Markatos'  algorithm  maintains 
a  doubly  linked  list.  When  the  lock  holder  releases  the  lock,  it  searches  for  the 
highest  priority  process  in  the  queue.  This  process'  record  is  moved  to  the  head 
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of  the  queue,  and  its  flag  is  reset.  However,  the  point  at  which  a  task  requests  or 
releases  a  lock  is  not  well  defined,  and  the  lock  holder  might  release  the  lock  to  a  low 
priority  task  even  though  a  higher  priority  task  has  entered  the  queue.  In  addition, 
the  work  of  maintaining  the  priority  queue  is  performed  when  a  lock  is  released.  This 
choice  makes  the  time  to  release  a  lock  unpredictable,  and  significantly  increases  the 
time  to  acquire  or  release  a  lock  (as  is  shown  in  section  3.5).  Craig  [19]  proposes  a 
modification  to  the  MCS  lock  and  to  Markatos'  lock  that  substitutes  an  atomic  Swap 
for  the  Compare&Swap  instruction,  and  permits  nested  locks  using  only  one  lock 
record  per  process.  Takada  and  Sakamura  [91]  extended  queuing  spin-locks  modified 
to  be  preemptable  for  servicing  interrupts. 

Goscinski  [29]  developed  two  algorithms  for  mutual  exclusion  for  real-time  dis- 
tributed systems.  The  algorithms  are  based  on  token  passing.  A  process  requests 
the  critical  section  by  broadcasting  its  intention  to  all  other  processes  in  the  system. 
One  algorithm  grants  the  token  based  on  the  priorities  of  the  processes,  whereas  the 
other  algorithm  grants  the  token  to  processes  based  on  the  remaining  time  to  run  the 
processes.  The  holder  of  the  token  enters  the  critical  section. 

The  utility  of  prioritized  locks  is  demonstrated  by  rate  monotonic  scheduling  the- 
ory [59,  80].  Suppose  there  are  N  periodic  processes  7\,  T2,  T3, . . . ,  TN  on  a  unipro- 
cessor. Let  Ei  and  C,  represent  the  execution  time  and  the  cycle  time  (periodicity) 
of  the  process  T{.  We  assume  that  Cx  <  C2  <  ...  <  CN.  Under  the  assumption  that 
there  is  no  blocking,  [59]  show  that  if  for  each  j 

X>/a  <  i(2^  -i) 

t'=l 

Then  all  processes  can  meet  their  deadlines. 
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Suppose  that  Bj  is  the  worst  case  blocking  time  that  process  Tj  will  incur.  Then 
[80]  show  that  all  tasks  can  meet  their  deadlines  if 

1=1 

Thus,  the  blocking  of  a  high  priority  process  by  a  lower  priority  process  has  a 
significant  impact  on  the  ability  of  tasks  to  meet  their  deadlines.  Much  work  has 
been  done  to  bound  the  blocking  due  to  lower  priority  processes.  For  example,  the 
Priority  Ceiling  protocol  [80]  guarantees  that  a  high  priority  process  is  blocked  by  a 
lower  priority  process  for  the  duration  of  at  most  one  critical  section.  The  Priority 
Ceiling  protocol  has  been  extended  to  handle  dynamic-priority  schedulers  [16]  and 
multiprocessors  [17,  79]. 

Our  contribution  over  previous  work  in  developing  prioritized  contention-free  spin- 
locks  ([19]  and  [63])  is  to  more  directly  implement  the  desired  priority  queue.  Our 
algorithm  maintains  a  pointer  to  the  head  of  the  lock  queue,  which  is  the  record  of  the 
lock  holder.  As  a  result,  the  PR-Lock  can  be  used  to  implement  priority  inheritance 
[79,  80].  The  work  of  maintaining  priority  ordering  is  performed  in  the  acquire  lock 
operation,  when  a  task  is  blocked  anyway.  The  time  required  to  release  a  lock  is 
small  and  predictable,  which  reduces  the  length  and  the  variance  of  the  time  spent 
in  the  critical  section.  The  PR-Lock  has  well-defined  points  in  time  in  which  a  task 
joins  the  lock  queue  and  releases  its  lock.  As  a  result,  we  can  guarantee  that  the 
highest  priority  waiting  task  always  receives  the  lock.  Finally,  we  provide  a  proof  of 
correctness. 

3.2    PR-Lock  Algorithm 

Our  PR-Lock  algorithm  is  similar  to  the  MCS-Lock  algorithm  in  that  both  main- 
tain queues  of  blocked  processes  using  the  Compare&Swap  instruction.  However, 
while  the  MCS-Lock  and  Markatos'  lock  maintain  a  global  pointer  to  the  tail  of  the 


queue,  the  PR-Lock  algorithm  maintains  a  global  pointer  to  the  head  of  the  queue.  In 
both  the  MCS-Lock  and  the  Markatos'  lock,  the  processes  are  queued  in  FIFO  order, 
whereas  in  the  PR-Lock,  the  queue  is  maintained  in  priority  order  of  the  processes. 

3.2.1  Assumptions 

We  make  the  following  assumptions  about  the  computing  environment. 

1.  The  underlying  multiprocessor  architecture  supports  an  atomic  Compare&Swap 
instruction.  We  note  that  many  parallel  architectures  support  this  instruction, 
or  a  related  instruction  [35,  74,  99]. 

2.  The  multiprocessor  has  shared-memory  with  coherent  caches,  or  has  locally- 
stored  but  globally-accessible  shared- memory. 

3.  Each  processor  has  a  record  to  place  in  the  queue  for  each  lock.  In  a  NUMA 
architecture,  this  record  is  allocated  in  the  local,  but  globally  accessible,  mem- 
ory. This  record  is  not  used  for  any  other  purpose  for  the  lifetime  of  the  queue. 
In  Section  3.4,  we  allow  the  record  to  be  used  among  many  lock  queues. 

4.  The  higher  the  actual  number  assigned  for  priority,  the  higher  the  priority  of  a 
process  (we  can  also  assume  the  opposite). 

5.  The  relative  priorities  of  blocked  processes  do  not  change.  Acceptable  priority 
assignment  algorithms  include  Earliest  Deadline  First  and  Minimum  Laxity. 

It  should  be  noted  that  each  process  p,  participating  in  the  synchronization  can 
be  associated  with  an  unique  processor  P;.  We  expect  that  the  queued  processes  will 
not  be  preempted,  though  this  is  not  a  requirement  for  correctness. 
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3.2.2  Implementation 

The  PR-Lock  algorithm  consists  of  two  operations.  The  acquire  Jock  operation 
acquires  a  designated  lock  and  the  releaseJock  operation  releases  the  lock.  Each 
process  uses  the  acquire  Jock  and  releaseJock  operations  to  synchronize  access  to 
a  resource  as  follows. 

acquireJock(L,  r) 

critical  section 
releaseJock  (L) 

The  following  sub-sections  present  the  required  version  of  Compare&Swap,  the 
needed  data  structures,  and  the  acquireJock  and  releaseJock  procedures. 

The  ComparefcSwap 

The  PR-Lock  algorithms  make  use  of  the  Compare&Swap  instruction,  the  code 
for  which  is  shown  in  Figure  3.1.  Compare&Swap  is  often  used  on  pointers  to  object 
records,  where  a  record  refers  to  the  physical  memory  space  and  an  object  refers  to  the 
data  within  a  record.  Current  is  a  pointer  to  a  record,  Old  is  a  previously  sampled 
value  of  Current,  and  New  is  a  pointer  to  a  record  that  we  would  like  to  substitute 
for  *01d  (the  record  pointed  to  by  Old).  We  compute  the  record  *New  based  on  the 
object  in  *01d  (or  decide  to  perform  the  swap  based  on  the  object  in  *01d),  so  we 
want  to  set  Current  equal  to  New  only  if  Current  still  points  to  the  record  *01d. 
However,  even  if  Current  points  to  *01d,  it  might  point  to  a  different  object  than 
the  one  originally  read.  This  will  occur  if  *01d  is  removed  from  the  data  structure, 
then  re-inserted  as  Current  with  a  new  object.  This  sequence  of  events  cannot  be 
detected  by  the  Compare&Swap  and  is  known  as  the  A-B-A  problem. 

Following  the  work  of  Prakash  et  al.  [77]  and  Turek  et  al.  [97],  we  make  use  of 
a  double-word  Compare&Swap  instruction  [74]  to  avoid  this  problem.  A  counter  is 
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Procedure  CAS (structure  pointer  *Current,  *0ld,  *New) 

/*  Assume  CAS  operates  on  double  words  */ 

atomic{ 

if(  *Current  ==  *01d  )  { 

♦Current  =  *New; 

return (TRUE) ; 

} 

else  { 

*01d  =  *Current; 
return (FALSE) ; 

} 

} 


Figure  3.1.  CAS  used  in  the  PR- Lock  Algorithm 

appended  to  Current  which  is  treated  as  a  part  of  Current.  Thus  Current  consists 
of  two  parts:  the  value  part  of  Current  and  the  counter  part  of  Current.  This 
counter  is  incremented  every  time  a  modification  is  made  to  *Current.  Now  all 
the  variables  Current,  Old  ,  and  New  are  twice  their  original  size.  This  approach 
reduces  the  probability  of  occurrence  of  the  A-B-A  problem  to  acceptable  levels  for 
practical  applications.  If  a  double-word  Compare&Swap  is  not  available,  the  address 
and  counter  can  be  packed  into  32  bits  by  restricting  the  possible  address  range  of 
the  lock  records. 

We  use  a  version  of  the  Compare&Swap  operation  in  which  the  current  value  of 
the  target  location  is  returned  in  old,  if  the  Compare&Swap  fails.  The  semantics  of 
the  Compare&Swap  used  is  given  in  Figure  3.1.  A  version  of  the  Compare&Swap 
instruction  that  returns  only  TRUE  or  FALSE  can  be  used  by  performing  an  additional 
read. 
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Data  Structures 

The  basic  data  structure  used  in  the  PR-Lock  algorithm  is  a  priority  queue.  The 
lock  L  contains  a  pointer  to  the  first  record  of  the  queue.  The  first  record  of  the 
queue  belongs  to  the  process  currently  using  the  lock.  If  there  is  no  such  process, 
then  L  contains  nil. 

Each  process  has  a  locally-stored  but  globally- accessible  record  to  insert  into  the 
lock  queue.  If  process  p  inserts  record  q  into  the  queue,  we  say  that  q  is  p's  record  and 
p  is  q's  process.  The  record  contains  the  process  priority,  the  next-record  pointer,  a 
boolean  flag  Locked  on  which  the  process  owning  the  element  busy- waits  if  the  lock  is 
not  free,  and  an  additional  field  Data  that  can  be  used  to  store  application-dependent 
information  about  the  lock  holder. 

The  next-record  pointer  is  a  double  sized  variable:  one  half  is  the  actual  pointer 
and  the  other  half  is  a  counter  to  avoid  the  A-B-A  problem.  The  counter  portion  of 
the  pointer  itself  has  into  two  parts:  one  bit  of  the  counter  called  the  Dq  bit  is  used 
to  indicate  whether  the  queuing  element  is  in  the  queue.  The  rest  of  the  bits  are 
used  as  the  actual  counter.  This  technique  is  similar  to  the  one  used  by  Prakash  et 
al.  [77]  and  Turek  et  al.  [97].  Their  counter  refers  to  the  record  referenced  by  the 
pointer.  In  our  algorithm,  the  counter  refers  to  the  record  that  contains  the  pointer, 
not  the  record  that  is  pointed  to. 

If  the  Dq  bit  of  a  record  q  is  FALSE,  then  the  record  is  in  the  queue  for  a  lock 
L.  If  the  Dq  bit  is  TRUE,  then  the  record  is  probably  not  in  the  queue  (for  a  short 
period  of  time,  the  record  might  be  in  the  queue  with  its  Dq  bit  set  TRUE).  The  Dq 
bit  lets  the  PR-Lock  avoid  garbage  accesses. 

Each  process  keeps  the  address  of  its  record  in  a  local  variable  (Self).  In  addition, 
each  process  requires  two  local  pointer  variables  to  hold  the  previous  and  the  next 
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queue  element  for  navigating  the  queue  during  the  enqueue  operation  (Prev_Node 
and  Next-Node). 

The  data  structures  used  are  shown  in  Figure  3.2.  The  Dq  bit  of  the  Pointer  field 
is  initialized  to  TRUE,  and  the  Ctr  field  is  initialized  to  0  before  the  record  is  first 
used. 

A  typical  queue  formed  by  the  PR-Lock  algorithm  is  shown  in  Figure  3.3  below. 
Here  L  points  to  the  record  q0  of  the  current  process  holding  the  lock.  The  record  q0 
has  a  pointer  to  the  record  q\  of  the  next  process  having  the  highest  priority  among 
the  processes  waiting  to  acquire  the  lock  L.  Record  qi  points  to  record  q2  of  the  next 
higher  priority  waiting  process  and  so  on.  The  record  qn  belongs  to  the  process  with 
the  least  priority  among  waiting  processes. 

AcquireXock  Operation 

The  acquireJock  operation  is  called  by  a  process  p  before  using  the  critical 
section  or  resource  guarded  by  lock  L.  The  parameters  of  the  acquireJock  operation 
are  the  lock  pointer  L  and  the  record  q  of  the  process  (passed  to  local  variable  Self). 

An  acquireJock  operation  searches  for  the  correct  position  to  insert  q  into  the 
queue  using  PrevJIode  and  NextJIode  to  keep  track  of  the  current  position.  In  Fig- 
ure 3.4,  Prev-Node  and  Next-Node  are  abbreviated  to  P  and  N.  The  records  pointed 
by  P  and  N  are  and  qi+i,  belonging  to  processes  p,  and  pi+i.  Process  p  positions 
itself  so  that  Pr(pi)  >  Pr(p)  >  Pr(pi+i),  where  Pr  is  a  function  which  maps  a 
process  to  its  priority.  Once  such  a  position  is  found,  q  is  prepared  for  insertion  by 
making  q  point  to  qi+i.  Then,  the  insertion  is  committed  by  making  qi  to  point  to 
q  by  using  the  Compare&Swap  instruction.  The  various  stages  and  final  result  are 
shown  in  Figure  3.4. 


structure  Pointer  { 

structure  Object  *Ptr; 
int31  Ctr; 
boolean  Dq; 

} 

structure  Record  { 

structure  structure  -of  -data  Data; 
boolean  Locked; 
integer  Priority; 
structure  Pointer  Next; 

} 

Shared  Variable 

structure  Pointer  L; 

Private  Variables 

structure  Pointer  Self,  PrevJJode,  Next_node; 

boolean  Success,  Failure; 

constant  TRUE,  FALSE,  NULL,  MAX_PRIORITY; 


Record  Structure 

Data 

Locked 
Priority 
Next.Ptr 
Next.Ctr 

\ 

 Next.Dq 


Figure  3.2.  Data  Structures  used  in  the  PR-Lock  Algorithm 
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Figure  3.3.  Queue  data  structure  used  in  PR-Lock  algorithm 
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Figure  3.4.  Stages  in  the  acquireJock  operation 
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The  acquireJock  algorithm  is  given  in  Figure  3.5.  Before  the  acquireJock 
procedure  is  called,  the  Data  and  the  Priority  fields  of  the  process'  record  are 
initialized  appropriately.  In  addition,  the  Dq  bit  of  the  Next  pointer  is  implicitly 
TRUE. 

The  acquireJock  operation  begins  by  assuming  that  the  lock  is  currently  free 
(the  lock  pointer  L  is  null).  It  attempts  to  change  L  to  point  to  its  own  record 
with  the  Compare&Swap  instruction.  If  the  Compare&Swap  is  successful,  the  lock  is 
indeed  free,  so  the  process  acquires  the  lock  without  busy-waiting.  In  the  context  of 
the  composite  pointer  structures  that  the  algorithm  uses,  a  NULL  pointer  is  all  zeros. 

If  the  swap  is  unsuccessful,  then  the  acquiring  process  traverses  the  queue  to 
position  itself  between  a  higher  or  equal  priority  process  record  and  a  lower  priority 
process  record.  Once  such  a  junction  is  found,  Prev_Node  will  point  to  the  record 
of  the  higher  priority  process  and  Next -Node  will  point  to  the  record  of  the  lower 
priority  process.  The  process  first  sets  its  link  to  NextJJode.  Then,  it  attempts  to 
change  the  previous  record's  link  to  its  own  record  by  the  atomic  Compare&Swap. 

If  successful,  the  process  sets  the  Dq  flag  in  its  record  to  FALSE  indicating  its 
presence  in  the  queue.  The  process  then  busy-waits  until  its  Locked  bit  is  set  to 
FALSE,  indicating  that  it  has  been  admitted  to  the  critical  section. 

There  are  three  cases  for  an  unsuccessful  attempt  at  entering  the  queue.  Problems 
are  detected  by  examining  the  returned  value  of  the  failed  Compare&Swap  marked 
as  F  in  the  algorithm.  Note  that  the  returned  value  is  in  the  NextJJode.  In  addition, 
a  process  might  detect  that  it  has  misnavigated  while  searching  the  queue.  When  we 
read  NextJJode,  the  contents  of  the  record  pointed  to  by  Prev  Jlode  are  fixed  because 
the  record's  counter  is  read  into  NextJJode. 
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Procedure  acquireJock(L,  Self)  { 
Success=FALSE; 
do  { 

PrevJiode=NULL ;  Next _Node=NULL 

if(CAS(*L,  &Next_Node,  &Self))  {  /*  No  Lock  */ 

Success=TRUE;  Failure=FALSE;  /*  Use  Lock  */ 

Self . Ptr->Next . Dq=FALSE ; 

} 

else  {  /*  Lock  in  Use  */ 

Failure=FALSE ;  Self . Ptr->Locked=TRUE ; 
do  { 

PrevJJode=Next_Node ; 
Next_Node=Prev_Node . Ptr->Next ; 

if((Next_Node.Dq==TRUE)  /*  Deque,  Try  Again  */  ii 
or  (Prev.Node.Ptr->Priority<Self .Ptr->Priority))  iii 
Failure=TRUE; 
else  { 

if(Next_Node.Ptr==NULL  or  (Next-Node .Ptr !  =NULL  and 

Next_Node . Ptr->Priority<Self . Ptr->Priority) ) { 
Self .  Ptr->Next .  Ptr=Next  Jiode .  Ptr 

if(CAS(&(Prev_Node.Ptr->Next)  ,  &Next_Node,  &Self))  {  F 
Self . Ptr->Next . Dq=FALSE ; 
while(Self  .Ptr->Locked) ;  /*  Busy  Wait  */ 
Success=TRUE;  /*  Then,  use  lock  */ 

} 

else  { 

if ( (Next-Node. Dq==TRUE)  /*  Deque,  Try  Again  */  ii 
or  Prev_Node.Ptr->Priority  <  Self .Ptr->Priority) )  iii 
Failure=TRUE; 
else 

Next_Node=Prev_Node;  i 

} 

} 

} 

} while (! Success  and  IFailure); 

}. 

} while (  ! Success) ; 

} 


Figure  3.5.  The  acquireJock  operation  procedure 
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1.  A  concurrent  acquire  Jock  operation  may  overtake  the  acquire  Jock  opera- 
tion and  insert  its  own  record  immediately  after  PrevJJode,  as  shown  in  Fig- 
ure 3  JO.  In  this  case  the  Compare&Swap  will  fail  at  the  position  marked  F  in 
Figure  3.5.  The  correctness  of  this  operation's  position  is  not  affected,  so  the 
operation  continues  from  its  current  position  (line  marked  by  i  in  Figure  3.5). 

2.  A  concurrent  release  Jock  operation  may  overtake  the  acquire  Jock  operation 
and  removes  the  record  pointed  to  by  PrevJJode,  as  shown  in  Figure  3.11.  In 
this  case,  the  Dq  bit  in  the  link  pointer  of  this  record  will  be  TRUE.  The 
algorithm  checks  for  this  condition  when  it  scans  through  the  queue  and  when 
it  tries  to  commit  its  modifications.  The  algorithm  detects  the  situation  in 
the  two  places  marked  by  ii  in  the  Figure  3.5.  Every  time  a  new  record  is 
accessed  (by  PrevJIode),  its  link  pointer  is  read  into  Next_Node  and  the  Dq  bit 
is  checked.  In  addition,  if  the  Compare&Swap  fails,  the  link  pointer  is  saved  in 
Next -Node  and  the  Dq  bit  is  tested.  If  the  Dq  bit  is  TRUE,  the  algorithm  starts 
from  the  beginning. 

3.  A  concurrent  release  Jock  operation  may  overtake  the  acquire  Jock  operation 
and  remove  the  record  pointed  to  by  PrevJIode,  and  then  the  record  is  put  back 
into  the  queue,  as  shown  in  Figure  3.12.  If  the  record  returns  with  a  priority 
higher  than  or  equal  to  Self's  priority,  then  the  position  is  still  correct  and 
the  operation  can  continue.  Otherwise,  the  operation  cannot  find  the  correct 
insertion  point,  so  it  has  to  start  from  the  beginning.  This  condition  is  tested 
at  the  lines  marked  iii  in  Figure  3.5. 

The  spin-lock  busy  waiting  of  a  process  is  broken  by  the  eventual  release  of  the 
lock  by  the  process  which  is  immediately  ahead  of  the  waiting  process. 
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Procedure  releaseJock(L,  Self){ 
Self .Dq=TRUE; 

L=Self  .Ptr->Next;  /*  Release  Lock  */ 
if  (Self .  Ptr->Next !  =NULL)  { 

L .  Ptr->Priority=MAX_PRIORITY ; 

L .  Ptr->Locked=FALSE; 

} 

} 


Figure  3.6.  The  releaseJock  operation  procedure 

Release-Lock  Operation 

The  releaseJock  operation  is  straight  forward  and  the  algorithm  is  given  in 
Figure  3.6.  The  process  p  releasing  the  lock  sets  the  Dq  bit  in  its  record's  Link 
pointer  to  TRUE,  indicating  that  the  record  is  no  longer  in  the  queue.  Setting  the 
Dq  bit  prevents  any  acquireJock  operation  from  modifying  the  link.  The  releasing 
process  copies  the  address  of  the  successor  record,  if  any,  to  L.  The  process  then 
releases  the  lock  by  setting  the  Locked  boolean  variable  in  the  record  of  the  next 
process  waiting  to  be  FALSE.  To  avoid  testing  special  cases  in  the  acquireJock 
operation,  the  priority  of  the  head  record  is  set  to  the  highest  possible  priority. 

3.3    Correctness  of  PR-Lock  Algorithm 

In  this  section,  we  present  an  informal  argument  for  the  correctness  properties 
of  our  PR-Lock  algorithm.  We  prove  that  the  PR-Lock  algorithm  is  correct  by 
showing  that  it  maintains  a  priority  queue,  and  the  head  of  the  priority  queue  is 
the  process  that  holds  the  lock.  The  PR- Lock  is  decisive-instruction  serializable  [83]. 
Both  operations  of  the  PR-Lock  algorithm  have  a  single  decisive  instruction.  The 
decisive  instruction  for  the  acquireJock  operation  is  the  successful  Compare&Swap 
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and  the  decisive  instruction  for  the  releaseJock  operation  is  setting  the  Dq  bit. 
Corresponding  to  a  concurrent  execution  C  of  the  queue  operations,  there  is  an 
equivalent  (with  respect  to  return  values  and  final  states)  serial  execution  Sd  such 
that  if  operation  0\  executes  its  decisive  instruction  before  operation  O2  does  in  C, 
then  0\  <  O2  in  Sd-  Thus,  the  equivalent  priority  queue  of  a  PR-Lock  is  in  a  single 
state  at  any  instant,  simplifying  the  correctness  proof  (a  concurrent  data  structure 
that  is  linearizable  but  not  decisive-instruction  serializable  might  be  in  several  states 
simultaneously  [37]). 

We  use  the  following  notation  in  our  discussion.  PR-Lock  C  has  lock  pointer  L, 
which  points  to  the  first  record  in  the  lock  queue  (and  the  record  of  the  process  that 
holds  the  lock).  Let  there  be  N  processes  pl5  p2l  •  •  Pn  that  participate  in  the  lock 
synchronization  for  a  priority  lock  £,  using  the  PR-Lock  algorithm.  As  mentioned 
earlier,  each  process  p,  allocates  a  record  to  enqueue  and  dequeue.  Thus,  each 
process  p,  participating  in  the  lock  access  is  associated  with  a  queue  record  <7,.  Let 
Pr(pi)  be  a  function  which  maps  a  process  to  its  priority,  a  number  between  1  and  N. 
We  also  define  another  function  Pr(qi)  which  maps  a  record  belonging  to  a  process 
Pi  to  its  priority. 

A  priority  queue  is  an  abstract  data  type  that  consists  of 

•  A  finite  set  Q  of  elements  qt,  i  =  1. .  .N 

•  A  function  Pr  :  qi  — ►  n,  ,where  n,  €  Af.  For  simplicity,  we  assume  that  every  n,- 
is  unique.  This  assumption  is  not  required  for  correctness,  and  in  fact  processes 
of  the  same  priority  will  obtain  the  lock  in  FCFS  order. 

•  Two  operations  enqueue  and  dequeue 


42 


At  any  instant,  the  state  of  the  queue  can  be  defined  as 


where  qx  <Q  q2  <q  . . .  <q  ft,,  and  q{  <Q  qj  iff  Pr(qi)  >  Pr(qj). 

We  call  q0  the  head  record  of  priority  queue  Q.  The  head  record's  process  is  the 
current  lock  holder.  Note  that  the  non-head  records  are  totally  ordered. 

The  enqueue  operation  is  defined  as 

enqueue((q0,  ft,  ft,  •  • ,  ft),  <?)  -»  (ft,  ft,  92,  •  •  • ,  ft,  ft  ft+i,  •  •  • ,  ft) 

where  Pr(g,-)  >  i>r(g)  >  Pr(qi+i). 

The  dequeue  operation  on  a  non-empty  queue  is  defined  as 


where  the  return  value  is  q0.  A  dequeue  operation  on  an  empty  queue  is  undefined. 

For  every  PR-Lock  C,  there  is  an  abstract  priority  queue  Q.  Initially,  both  C  and 
Q  are  empty.  When  a  process  p  with  a  record  q  performs  the  decisive  instruction 
for  the  acquireJock  operation,  Q  changes  state  to  enqueue(Q,q).  Similarly,  when 
a  process  executes  the  decisive  instruction  for  a  release  Jock  operation,  Q  changes 
state  to  dequeue(Q). 

We  show  that  when  we  observe  £,  we  find  a  structure  that  is  equivalent  to  Q. 
To  observe  £,  we  take  a  consistent  snapshot  [15]  of  the  current  state  of  the  system 
memory.  Next,  we  start  at  the  lock  pointer  L  and  observe  the  records  following 
the  linked  list.  If  the  head  record  has  its  Dq  bit  set  and  its  process  has  exited  the 
acquireJock  operation,  then  we  discard  it  from  our  observation.  If  we  observe  the 
same  records  in  the  same  sequence  in  both  C  and  Q,  then  we  say  that  L  and  Q  are 
equivalent,  and  we  write  L  Q. 


dequeue((q0,q1,q2, . . . , 


ftO)  -»  (ft, 92,-.-, 
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Before:  L 


qO 

ql 
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qn 

After: 


qi 

q2 

qn 

Figure  3.7.  Observed  queue  £  before  and  after  a  releaseJock 

Theorem  1  The  representative  priority  queue  Q  is  equivalent  to  the  observed  queue 
of  the  PR-Lock  £. 

Proof.  We  prove  the  theorem  by  induction  on  the  decisive  instructions,  using  the 
following  two  lemmas. 

Lemma  1  If  Q  <=>  L  before  a  releaseJock  decisive  instruction,  then  Q  O  £  after 
the  releaseJock  decisive  instruction. 

Proof.  Let  Q  =  (qo,qi,q2,  •  •  •  ,1n)  before  a  releaseJock  decisive  instruction.  A 
releaseJock  operation  is  equivalent  to  a  dequeue  operation  on  the  abstract  queue. 
By  definition, 

dequeue((q0,q1,q2, . . .  ,qn))  -»       ?2,  •  •  • ,  qn) 

The  before  and  after  states  of  £  are  shown  in  Figure  3.7.  If  L  points  to  the 
record  q0  before  the  releaseJock  decisive  instruction,  the  releaseJock  decisive 
instruction  sets  the  Dq  bit  in  q0  to  TRUE,  removing  (fa  from  the  observable  queue. 
Thus,  Q  £  after  the  releaseJock  operation.  Note  that  L  will  point  to  qx  before 
the  next  releaseJock  decisive  instruction.  □ 


Lemma  2  If  Q  £  before  an  acquire  Jock  decisive  instruction,  then  Q  <S>  £  after 
the  acquire  Jock  decisive  instruction. 
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Figure  3.8.  Observed  queue  C  before  and  after  an  acquire  Jock  to  an  empty  queue 
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Figure  3.9.  Observed  queue  £  before  and  after  an  acquireJock  to  a  non-empty 
queue 

Proof.  There  are  two  different  cases  to  consider: 

Case  1:  Q  =  ()  before  the  acquireJock  decisive  instruction.  The  equivalent 
operation  on  the  abstract  queue  Q  is  the  enqueue  operation.  Thus, 

enqueue((),  q)  —♦  (q) 

If  the  lock  C  is  empty,  q's  process  executes  a  successful  decisive  Compare&Swap 
instruction  to  make  L  to  point  to  q  and  acquires  the  lock  (Figure  3.8). 

Clearly,  Q      C  after  the  acquireJock  decisive  instruction. 

Case  2:  Q  =  (qo,  qi,qi,  ■  ■  ■ ,  qn)  before  the  acquireJock  decisive  instruction.  The 
state  of  the  queue  Q  after  the  acquireJock  is  given  by 


enqueue  =  ((q0,  qu  q2, . . . ,  qn),  q)  -*  (ftiftift,  •  •  •  ,9i,9,9t+l,  •  • .  ,9n) 
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The  corresponding  C  before  and  after  the  acquireJock  is  shown  in  Figure  3.9. 
The  pointers  P  and  N  are  the  Prev.Node  and  Next-Node  pointers  by  which  cfs  ac- 
quireJock operation  positions  its  record  such  that  the  process  observes  Pr(qi)  > 
Pr(q)  >  Pr(qi+1).  Then,  the  Next  pointer  in  q  is  is  set  to  the  address  of 
The  Compare&Swap  instruction,  marked  F  in  Figure  3.5,  attempts  to  make  the 
Next  pointer  in  point  to  q.  If  the  Compare&Swap  instruction  succeeds,  then  it 
is  the  decisive  instruction  of  g's  process  and  the  resulting  queue  C  is  illustrated  in 
the  Figure  3.9.  This  is  equivalent  to  Q  after  the  enqueue  operation.  If  the  Com- 
pare&Swap  succeeds  only  when  is  in  the  queue,  <jr,+i  is  the  successor  record,  and 
Pr(qi)  <  Pr(q)  <  Pr(qi+1),  then  Q  &  C. 

If  there  are  no  concurrent  operations  on  the  queue,  we  can  observe  that  the  P 
and  N  are  positioned  correctly  and  the  Compare&Swap  succeeds.  If  there  are  other 
concurrent  operations,  they  can  interfere  with  the  execution  of  an  acquireJock 
operation,  A.  There  are  three  possibilities: 

Case  a:  Another  acquireJock  A  'enqueued  its  record  q'  between  and  but 
qi  has  not  yet  been  dequeued.  If  Pr(qi)  >  Pr(q)  >  Pr(<7,+1),  then  #'s  process  will 
attempt  to  insert  q  between  and  Process  A'  has  modified  <jf,-'s  next  pointer, 
so  that  <?'s  Compare&Swap  will  fail.  Since  qi  has  not  been  dequeued,  Pr(qi)  > 
Pr(q),  and  <fs  process  should  continue  its  search  from  qi,  which  is  what  happens.  If 
Pr(qi+1)  >  Pr(q)  then  qys  process  can  skip  over  q'  and  continue  searching  from 
which  is  what  happens.  This  scenario  is  illustrated  in  Figure  3.10. 

Case  b:  A  release  Jock  operation  R  overtakes  A  and  removes  qi  from  the  queue 
(i.e.,  R  has  set  g.'s  Dq  bit),  and  qi  has  not  yet  been  returned  to  the  queue  (its  Dq  bit 
is  still  false).  Since  9,  is  not  in  the  lock  queue,  A  is  lost  and  must  start  searching 
again.  Based  on  its  observations  of  9,  and  <?t+i,  A  may  have  decided  to  continue 
searching  the  queue  or  to  commit  its  operation.   In  either  case  A  sees  the  Dq  bit 
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Figure  3.10.  A  concurrent  acquire  Jock  A'  succeeds  before  A 


set  and  fails,  so  A  starts  again  from  the  beginning  of  the  queue.  This  scenario  is 
illustrated  in  Figure  3.11 

Case  c:  A  release  Jock  operation  R  overtakes  A  and  removes  from  the  queue, 
and  then  is  put  back  in  the  queue  by  another  acquire  Jock  A '.  If  A  tries  to  commit 
its  operation,  then  the  pointer  in  qi  is  changed,  so  the  Compare&Swap  fails.  Note 
that  even  if  qi  is  pointing  to  <7,+i,  the  version  numbers  prevent  the  decisive  instruction 
from  succeeding.  If  A  continues  searching,  then  there  are  two  possibilities  based  on 
the  new  value  of  Pr(qi).  If  Pr(q)  >  Pr(qi),  A  is  lost  and  cannot  find  the  correct 
place  to  insert  q.  This  condition  is  detected  when  the  priority  of  qi  is  examined  (the 
lines  marked  iii  in  Figure  3.5),  and  operation  A  restarts  from  the  head  of  the  queue. 
If  Pr(q)  <  Pr(qi),  then  A  can  still  find  a  correct  place  to  insert  q  past  qi,  and  A 
continues  searching.  This  scenario  is  illustrated  in  Figure  3.12. 

No  matter  what  interference  occurs,  A  always  takes  the  right  action.  Therefore, 
Q      C  after  the  acquire  Jock  decisive  instruction.  □ 
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Figure  3.11.  A  concurrent  release  Jock  R  succeeds  before  A 

To  prove  the  theorem  we  use  induction.  Initially,  Q  =  ()  and  L  points  to  nil. 
So,  Q  <=>  C  is  trivially  true.  Suppose  that  the  theorem  is  true  before  the  ith  decisive 
instruction.  If  the  ith  decisive  instruction  is  for  an  acquire  Jock  operation,  Lemma  2 
=>•  Q  C  after  the  ith  decisive  instruction.  If  the  ith  decisive  instruction  is  for 
a  releaseJock  operation,  Lemma  1  Q  43>  C  after  the  ith  decisive  instruction. 
Therefore,  the  inductive  step  holds,  and  hence,  Q  <=S>  C.  □ 

3.4  Extensions 

In  this  section  we  discuss  a  couple  of  simple  extensions  that  increase  the  utility 
of  the  PRJjOck  algorithm. 

3.4.1    Multiple  Locks 

As  described,  a  record  for  a  PRrLock  can  be  used  for  one  lock  queue  only  (oth- 
erwise, a  process  might  obtain  a  lock  other  than  the  one  it  desired).  If  the  real-time 
system  has  several  critical  sections,  each  with  their  own  locks  (which  is  likely),  each 
process  must  have  a  lock  record  for  each  lock  queue,  which  wastes  space. 

Fortunately,  a  simple  extension  of  the  PR-Lock  algorithm  allows  a  lock  record  to 
be  used  in  many  different  lock  queues.  We  replace  the  Dq  bit  by  a  Dq  string  of  /  bits. 
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Figure  3.12.  ReleaseJock  R  and  acquireJock  A  'succeed  before  A 

If  the  Dq  string  evaluates  to  i  >  0  when  interpreted  as  a  binary  number,  then  the 
record  in  in  the  queue  for  lock  i.  If  the  Dq  string  evaluates  to  0,  then  the  record  is 
(probably)  not  in  any  queue.  The  acquire JLock  and  release JLock  algorithms  carry 
through  by  modifying  the  test  for  being  or  not  being  in  queue  i  appropriately. 

We  note  that  if  a  process  sets  nested  locks,  a  new  lock  record  must  be  used  for 
each  level  of  nesting.  Craig  [19]  presents  a  method  for  reusing  the  same  record  for 
nested  locks. 

3.4.2    Backing  Out 

If  a  process  does  not  obtain  the  lock  after  a  certain  deadline,  it  might  wish  to 
stop  waiting  and  continue  processing.  The  process  must  first  remove  its  record  from 
the  lock  queue.  To  do  so,  the  process  follows  these  steps: 
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1.  Find  the  preceding  record  in  the  lock  queue,  using  the  method  from  the  algo- 
rithm for  the  acquire  JLock  operation.  If  the  process  determines  that  its  record 
is  at  the  head  of  the  lock  queue,  return  with  a  "lock  obtained"  value. 

2.  Set  the  Dq  bit  (Dq  string)  of  the  process'  record  to  "Dequeued" . 

3.  Perform  a  compare  and  swap  of  the  predecessor  record's  next  pointer  with 
the  process'  next  pointer.  If  the  Compare&swap  fails,  go  to  1.  If  the  Com- 
pare&swap  succeeds,  return  with  a  "lock  released"  value. 

Step  2  fixes  the  value  of  the  process's  successor.  If  the  process  removes  itself  from 
the  queue  without  obtaining  the  lock,  the  Compare&swap  is  the  decisive  instruction. 
If  the  Compare&swap  fails,  the  predecessor  might  have  released  the  lock,  or  third 
process  has  enqueued  itself  as  the  predecessor.  The  process  can't  distinguish  between 
these  possibilities,  so  it  must  re-search  the  lock  queue. 

3.5    Simulation  Results 

We  simulated  the  execution  of  the  PR-Lock  algorithm  in  PROTEUS,  which  is 
a  configurable  multiprocessor  simulator  [12].  We  also  implemented  the  MCS-Lock 
and  Markatos'  lock  to  demonstrate  the  difference  in  the  acquisition  and  release  time 
characteristics. 

In  the  simulation,  we  use  a  multiprocessor  model  with  eight  processors  and  a 
global  shared-memory.  Each  processor  has  a  local  cache  memory  of  2048  bytes  size. 
In  PROTEUS,  the  units  of  execution  time  are  cycles.  Each  process  executes  for  a 
uniformly  randomly  distributed  time,  in  the  range  1  to  35  cycles,  before  it  issues 
an  acquire-lock  request.  After  acquiring  the  lock,  the  process  stays  in  the  critical 
section  for  a  fixed  number  of  cycles  (150)  plus  another  uniformly  randomly  distributed 
number  (1  to  400)  of  cycles  before  releasing  the  lock.  This  procedure  is  repeated  fifty 
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times.  The  average  number  of  cycles  taken  to  acquire  a  lock  by  a  process  is  then 
computed.  PROTEUS  simulates  parallelism  by  repeatedly  executing  a  processor's 
program  for  a  time  quantum,  Q.  In  our  simulations,  Q  =  10.  The  priority  of  a 
process  is  set  equal  to  the  process/processor  number  and  the  lower  the  number,  the 
higher  the  priority  of  a  process. 

Figure  3.13  shows  the  average  time  taken  for  a  process  to  acquire  a  lock  using 
the  MCS-Lock  algorithm,  the  PR-Lock  algorithm,  and  the  Markatos'  lock  algorithm. 
A  process  using  MCS-Lock  algorithm  has  to  wait  in  the  FIFO  queue  for  all  other 
processes  in  every  round.  However,  a  process  using  the  PR-Lock  algorithm  will 
wait  for  a  time  that  is  proportional  to  the  number  of  higher  priority  processes.  As 
an  example,  the  highest  and  second  highest  priority  process  on  the  average  waits  for 
about  one  critical  section  period.  We  note  that  the  two  highest  priority  processes  have 
about  the  same  acquire  lock  execution  time  because  they  alternate  in  acquiring  the 
lock.  Only  after  both  of  these  processes  have  completed  their  execution  can  the  third 
and  fourth  highest  priority  processes  obtain  the  lock.  Figure  3.13  clearly  demonstrates 
that  the  average  acquisition  time  for  a  lock  using  PR-Lock  is  proportional  to  the 
process  priorities,  whereas  the  average  acquisition  time  is  proportional  to  the  number 
of  processes  in  case  of  the  MCS-Lock  algorithm.  This  feature  makes  the  PR-Lock 
algorithm  attractive  for  use  in  real-time  systems. 

The  same  prioritized  lock-acquisition  behavior  is  shown  using  Markatos'  algo- 
rithm, but  the  average  time  to  acquire  a  lock  is  50%  greater  than  when  the  PR- Lock 
is  used.  At  first  this  result  is  puzzling,  because  Markatos'  lock  performs  the  majority 
of  its  work  when  the  lock  is  released  and  the  PR-Lock  performs  its  work  when  the 
lock  is  acquired.  However,  the  time  to  release  a  lock  is  part  of  the  time  spent  in  the 
critical  section,  and  the  time  to  acquire  a  lock  depends  primarily  on  time  spent  in 
the  critical  section  by  the  preceding  lock  holders.  Thus,  the  PR-Lock  allows  much 
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faster  access  to  the  critical  section.  As  we  will  see,  the  PR-Lock  also  allows  more 
predictable  access  to  the  critical  section. 

Finally,  we  compared  the  time  required  to  release  a  lock  using  both  the  PR- 
Lock  and  Markatos'  lock.  These  results  are  shown  in  Figure  3.14.  The  time  to 
release  a  lock  using  PR-Lock  is  small,  and  is  consistent  for  all  of  the  processes. 
Releasing  a  lock  using  Markatos'  lock  requires  significantly  more  time.  Furthermore, 
in  our  experiments  a  high  priority  process  is  required  to  spend  significantly  more  time 
releasing  a  lock  than  is  required  for  a  low  priority  process.  This  behavior  is  a  result 
of  the  way  that  the  simulation  was  run.  When  high  priority  processes  are  executing, 
all  low  priority  processes  are  blocked  in  the  queue.  As  a  result,  many  records  must  be 
searched  when  a  high  priority  process  releases  a  lock.  Thus,  a  high  priority  process 
does  work  on  behalf  of  low  priority  processes.  The  time  required  for  a  high  priority 
process  to  release  its  lock  depends  on  the  number  of  blocked  processes  in  the  queue. 
The  result  is  a  long  and  unpredictable  amount  of  time  required  to  release  a  lock. 
Since  the  lock  must  be  released  before  the  next  process  can  acquire  the  lock,  the  time 
required  to  acquire  a  lock  is  also  made  long  and  unpredictable. 

Most  of  the  time  the  cache-hit  ratio  is  95%  or  higher  on  each  of  the  processors 
using  the  PR-Lock  algorithm,  and  we  found  an  average  cache  hit  range  of  99.72%  to 
99.87%.  Thus,  the  PR-Lock  generates  very  little  network  or  memory  contention  in 
spite  of  the  processes  using  busy-waiting. 

3.6  Conclusion 

In  this  chapter,  we  present  a  priority  spin-lock  synchronization  algorithm,  the  PR- 
Lock,  which  is  suitable  for  real-time  shared-memory  multiprocessors.  The  PR-Lock 
algorithm  is  characterized  by  a  prioritized  lock  acquisition,  a  low  release  overhead, 
very  little  bus-contention,  and  well-defined  semantics.  Simulation  results  show  that 
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the  PR-Lock  algorithm  performs  well  in  practice.  This  priority  lock  algorithm  can  be 
used  as  presented  for  mutually  exclusive  access  to  a  critical  section  or  can  be  used  to 
provide  higher  level  synchronization  constructs  such  as  prioritized  semaphores  and 
monitors.  The  PR-Lock  maintains  a  pointer  to  the  record  of  the  lock  holder,  so 
the  PR-Lock  can  be  used  to  implement  priority  inheritance  protocols.  Finally,  the 
PR-Lock  algorithm  can  be  adapted  for  use  as  a  single-dequeuer,  multiple-enqueuer 
parallel  priority  queue. 

While  several  prioritized  spin-locks  have  been  proposed,  the  PR-Lock  has  the 
following  advantages. 

•  The  algorithm  is  contention  free. 

•  A  higher  priority  process  does  not  have  to  work  for  a  lower  priority  process 
while  releasing  a  lock.  As  a  result,  the  time  required  to  acquire  and  release  a 
lock  is  fast  and  predictable. 

•  The  PR- Lock  has  a  well-defined  acquire-lock  point. 

•  The  PR- Lock  maintains  a  pointer  to  the  process  using  the  lock  that  facilitates 
implementing  priority  inheritance  protocols. 

For  future  work,  we  are  interested  in  prioritizing  access  to  other  operating  system 
structures  to  make  them  more  appropriate  for  use  in  a  real-time  parallel  operating 
system. 


■  MCS-Lock 
□  PR-Lock 
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Figure  3.13.  Comparison  of  lock  acquisition  times 
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Figure  3.14.  Comparison  of  lock  release  times 


CHAPTER  4 
INTERRUPTIBLE  CRITICAL  SECTIONS 


4.1  Introduction 

The  scheduling  of  independent  real-time  tasks  is  well  understood,  as  optimal 
scheduling  algorithms  have  been  proposed  for  periodic  and  aperiodic  tasks  on  unipro- 
cessor [21,  59]  and  multiprocessor  systems  [20,  60,  70].  However,  if  the  tasks  com- 
municate through  shared  critical  sections,  a  low-priority  task  that  holds  a  lock  may 
block  a  high  priority  task  that  requires  the  lock,  causing  a  priority  inversion.  In 
this  chapter,  we  present  a  method  for  real-time  synchronization  that  avoids  priority 
inversions. 

We  present  a  new  approach  to  synchronization  on  uniprocessors  with  special  ap- 
plicability to  embedded  and  real-time  systems.  Existing  methods  for  synchronization 
in  real-time  systems  are  pessimistic,  and  use  blocking  to  enforce  concurrency  con- 
trol. While  protocols  to  bound  the  blocking  of  high  priority  tasks  exist,  high  priority 
tasks  can  still  be  blocked  by  low  priority  tasks.  In  addition,  these  protocols  require 
a  complex  interaction  with  the  scheduler.  We  propose  interruptible  critical  sections 
(i.e.,  optimistic  synchronization)  as  an  alternative  to  purely  blocking  methods.  Prac- 
tical optimistic  synchronization  requires  techniques  for  writing  interruptible  critical 
sections,  and  system  support  for  detecting  critical  section  access  conflicts.  We  dis- 
cuss our  implementation  of  an  interruptible  lock  on  a  system  running  the  pSOS-l- 
real-time  operating  system.  Our  experimental  performance  results  show  that  inter- 
ruptible locks  reduce  the  variance  in  the  response  time  of  the  highest  priority  task 
with  only  a  small  impact  on  the  performance  of  the  low  priority  tasks.  We  show  how 
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interruptible  critical  sections  can  be  combined  with  the  Priority  Ceiling  Protocol,  and 
present  an  analysis  which  shows  that  interruptible  locks  improve  the  schedulability 
of  task  sets  that  have  high  priority  tasks  with  tight  deadlines. 

Rajkumar,  Sha,  and  Lehoczky  [80]  have  proposed  the  Priority  Ceiling  Proto- 
col (PCP)  to  minimize  the  effect  of  priority  inversion.  The  priority  ceiling  of  a 
semaphore  S  is  the  priority  of  the  highest  priority  task  that  will  ever  lock  S.  A  task 
may  lock  a  semaphore  only  if  its  priority  is  higher  than  the  priority  ceiling  of  all 
locked  semaphores  (except  for  the  semaphores  that  it  has  locked).  The  PCP  guar- 
antees that  a  task  will  be  blocked  by  a  lower  priority  task  at  most  once  during  its 
execution.  However,  the  tasks  must  have  static  priorities  in  order  to  apply  the  Pri- 
ority Ceiling  Protocol.  In  addition,  blocking  for  even  the  duration  of  one  critical 
section  may  be  excessive.  Rajkumar,  Sha,  and  Lehoczky  have  extended  the  Priority 
Ceiling  Protocol  to  work  in  a  multiprocessor  system  [79]. 

Blocking-based  synchronization  algorithms  have  been  extended  to  work  with  dy- 
namic priority  schedulers.  Baker  [8]  presents  a  pre-allocation  based  synchronization 
algorithm  that  can  manage  resources  with  multiple  instances.  A  task's  execution  is 
delayed  until  the  scheduler  can  guarantee  that  the  task  can  execute  without  blocking 
a  higher  priority  task.  Tripathi  and  Nirkhe  [95],  and  Faulk  and  Parnas  [24]  also  dis- 
cuss pre-allocation  based  scheduling  methods.  Chen  and  Lin  [16]  extend  the  Priority 
Ceiling  Protocol  to  permit  dynamically-assigned  priorities.  Chen  and  Lin  [17]  extend 
the  protocol  in  [16]  to  account  for  multiple  resource  instances. 

Previous  approaches  to  real-time  synchronization  suffer  from  several  drawbacks. 
First,  a  high-priority  task  might  be  forced  to  wait  for  a  low-priority  task  to  complete  a 
critical  section.  Mercer  and  Tokuda  [68]  note  that  the  blocking  of  high-priority  tasks 
must  be  kept  to  a  minimum  in  order  to  ensure  the  responsiveness  of  the  real-time 
system.  If  tasks  can  have  delayed  release  times  [57],  a  high  priority  task  might  not  be 
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able  to  block  for  the  duration  of  a  critical  section  and  still  be  guaranteed  to  meet  its 
deadlines.  Jeffay  [43]  discusses  the  additional  feasibility  conditions  required  if  tasks 
have  preemption  constraints.  Second,  dynamic-priority  scheduling  algorithms  are 
feasible  with  much  higher  CPU  utilizations  than  static-priority  scheduling  algorithms 
[59],  and  dynamic-priority  schedulers  might  be  required  for  aperiodic  tasks.  The 
simple  Priority  Ceiling  Protocol  of  Rajkumar,  Sha,  and  Lehoczky  [80]  can  be  applied 
to  static-priority  schedulers  only.  The  dynamic-priority  synchronization  protocols 
[8,  16,  17]  are  complex,  and  must  be  closely  integrated  with  the  scheduling  algorithm. 

We  present  a  different  approach  to  synchronization,  one  which  guarantees  that 
a  high-priority  task  never  waits  for  a  low-priority  task  at  a  critical  section.  We  in- 
troduce the  idea  of  an  Interruptible  Critical  Section  (ICS),  which  is  a  critical  section 
protected  by  optimistic  concurrency  control  instead  of  by  blocking.  A  task  calculates 
its  modifications  to  the  shared  data  structure,  then  attempts  to  commit  its  modifi- 
cation. If  a  higher  priority  task  previously  committed  a  conflicting  modification,  the 
lower  priority  task  fails  to  commit,  and  must  try  again  (as  in  optimistic  concurrency 
control  [9]).  Otherwise,  the  task  succeeds,  and  continues  in  its  work.  The  synchro- 
nization algorithms  are  not  tied  to  the  scheduling  algorithm,  simplifying  the  design 
of  the  real-time  operating  system. 

A  purely  optimistic  approach  to  synchronization  can  starve  low  priority  tasks, 
leading  to  poor  performance  (i.e.  low  schedulability).  We  show  how  to  combine 
ICS  with  locking,  to  create  interruptible  locks.  Interruptible  locks  can  be  used  in 
conjunction  with  the  PCP  to  provide  schedulability  guarantees  for  the  low  priority 
tasks.  We  present  an  analysis  of  periodic  tasks  that  use  interruptible  locks  with  the 
Priority  Ceiling  Protocol. 

We  present  our  implementation  of  ICS  and  interruptible  locks  on  the  pSOS-|-  real- 
time operating  system,  and  show  that  we  can  reduce  the  maximum  response  time  of 
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a  high  priority  task.  Our  implementation  of  interruptible  locks  is  realized  through 
a  small  amount  of  code,  and  did  not  require  a  modification  of  the  pSOS-f  kernel 
(although  it  did  make  use  of  a  kernel  call-out  routine).  We  note  that  pSOS+  does 
not  provide  priority  inheritance. 

Interruptible  critical  sections  are  best  applied  in  embedded  or  real-time  operating 
systems  to  improve  the  schedulability  of  the  highest  priority  tasks.  An  operating 
system  for  embedded  systems  will  of  necessity  provide  the  flexibility  required  to 
implement  an  ICS  (as  does  pSOS+).  In  such  an  environment,  high  priority  tasks  can 
enter  an  ICS  without  making  a  system  call,  thus  avoiding  the  associated  overhead. 
Although  an  ICS  can't  reserve  resources  for  a  process  (but  can  co-exist  with  blocking 
algorithms  [8,  17,  80]  which  can  be  applied),  an  ICS  can  be  used  to  communicate  with 
a  high-priority  device  driver.  Low  priority  tasks  submit  requests  to  the  device  driver 
through  the  ICS,  and  the  device  is  serviced  by  a  high  priority  driver  which  obtains 
commands  through  the  ICS.  In  Section  4.8,  we  provide  examples  of  tasks  sets  that 
cannot  be  guaranteed  to  meet  their  deadlines  using  the  Priority  Ceiling  Protocol,  but 
are  feasible  if  interruptible  locks  are  used. 

4.2    Interruptible  Critical  Sections 

We  build  our  optimistic  synchronization  methods  on  Restartable  Atomic  Sequences 
(RAS)  [11].  A  RAS  is  a  section  of  code  that  is  re-executed  from  the  beginning  if  a  con- 
text switch  occurs  while  a  process  is  executing  in  the  code  section.  The  re-execution 
of  a  RAS  is  enforced  by  the  kernel  context-switch  mechanism.  If  the  kernel  detects 
that  the  process  program  counter  is  within  a  RAS  on  a  context  switch,  the  kernel 
sets  the  program  counter  to  the  start  of  the  RAS.  Bershad  et  al.  show  that  an  RAS 
implementation  of  an  atomic  test-and-set  has  better  performance  than  a  hardware 
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test-and-set  on  many  architectures,  and  is  much  faster  than  kernel-level  synchroniza- 
tion [11]. 

We  note  that  the  idea  of  scheduler  support  for  critical  sections  is  well  established. 
In  4.3BSD  UNIX,  a  system  call  that  is  interrupted  by  a  signal  is  restarted  using 
the  long  jump  instruction  [56].  Anderson  et  al.  [5]  argue  that  the  operating  system 
support  for  parallel  threads  should  recognize  that  a  preempted  thread  is  executing  in 
a  critical  section,  and  execute  the  preempted  thread  until  the  thread  exits  the  critical 
section.  In  addition,  Moss  and  Kohler  coded  several  of  the  run-time  support  calls  of 
the  Trellis/Owl  language  so  that  they  could  be  restarted  if  interrupted  [73]. 

The  simple  mechanism  described  in  [11]  is  too  crude  for  our  purposes,  because 
there  is  no  guarantee  that  a  conflicting  operation  was  performed  when  other  processes 
had  control  of  the  CPU.  The  unnecessary  re-executions  are  not  a  problem  for  the 
critical  sections  described  in  [11],  because  those  critical  sections  are  very  short  and 
a  re-execution  is  unlikely.  In  addition,  the  authors  of  [11]  did  not  need  to  consider 
the  predictability  required  by  real-time  systems.  If  the  critical  section  execution 
occupies  a  large  fraction  of  a  time  slice,  then  a  context  switch  is  far  more  likely.  To 
guarantee  progress,  a  process  that  is  interrupted  in  its  critical  section  execution  should 
be  restarted  only  if  a  conflicting  operation  was  executed.  We  call  a  region  of  code 
that  is  protected  in  this  manner  an  interruptible  critical  section  (ICS).  Restarting 
a  critical  section  only  if  a  conflicting  operation  was  performed  improves  real-time 
schedulability,  because  a  low  priority  task  can  experience  restarts  only  from  higher 
priority  tasks  that  share  a  critical  section,  instead  of  from  all  higher  priority  tasks. 

We  indicate  an  interruptible  critical  section  by  explicitly  declaring  it  so. 
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interruptible_critical -section  { 
stmtl ; 

stmtn; 
} 

As  an  example,  we  can  implement  a  shared  stack  as  an  ICS  by  using  the  following 
code. 

struct  stack_elem{ 
data  item; 

struct  stack_elem  *next; 
}  *SP 

push(elem) { 

stack_elem  *elem 

interruptible_criticaI_section{ 

elem— >next=sp ; 

sp=elem; 

} 

} 

stack-elem  *pop(){ 
struct  stack-elem  *temp 
interruptible_critical_section  { 

temp=sp; 

if (sp!=NULL) 
sp=sp— >next ; 

} 

return (temp) ; 

} 

4.3    Implementing  Interruptible  Critical  Sections 
4.3.1  Background 

The  techniques  used  to  write  interruptible  critical  sections  are  based  on  the  ideas 
developed  for  non-locking  concurrent  data  structures.  Herlihy  [35]  introduces  the  idea 
of  non-blocking  concurrent  objects.  An  algorithm  for  a  non-blocking  object  provides 
the  guarantee  that  one  of  the  processes  that  accesses  the  object  makes  progress  in 
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a  finite  number  of  steps.  Herlihy  provides  a  method  for  implementing  non-blocking 
objects  that  swaps  in  the  new  value  of  the  object  in  a  single  write.  Our  methods  are 
similar  to  an  extension  of  Herlihy's  work  proposed  by  Turek,  Shasha,  and  Prakash 
[97]. 

In  the  context  of  real-time  synchronization,  non-blocking  shared  objects  are  de- 
sirable because  a  high  priority  task  will  not  be  blocked  by  a  low  priority  task.  In  a 
uniprocessor  system,  only  one  process  at  a  time  will  access  the  shared  data  structures. 
We  can  take  advantage  of  the  serial  but  interruptible  access  to  simplify  the  specifi- 
cation of  the  existing  non-blocking  techniques,  and  to  improve  on  their  efficiency. 

In  an  interruptible  critical  section,  a  process  can  perform  only  one  write  that  is 
visible  to  other  processes.  Furthermore,  the  globally  visible  write  must  be  the  last 
instruction  in  the  protected  region.  Therefore,  a  process  that  is  executing  an  ICS 
records  its  updates  in  a  private  buffer  (the  commit  buffer).  The  final  write  commits 
the  updates  that  are  recorded  in  the  buffer  by  setting  a  commit  flag.  Any  subsequent 
process  that  executes  the  ICS  performs  the  updates  and  clears  the  commit  flag. 

This  approach  to  optimistic  synchronization  is  discussed  by  Alemany  and  Felton 
[2]  and  by  Bershad  [10].  In  this  chapter,  we  discuss  the  following  implementational 
details  that  do  not  appear  in  the  previous  work. 

•  Efficient  implementation  in  a  uniprocessor  system. 

•  How  to  perform  the  bulk  of  the  ICS  processing  outside  of  the  kernel. 

•  How  to  share  commit  buffers  among  processes. 

•  How  to  use  Herlihy's  small-object  protocol  [35]  to  minimize  the  number  of  writes 
that  must  be  placed  in  the  commit  buffer. 

•  How  to  apply  optimistic  synchronization  to  real-time  systems. 
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•  An  analysis  of  interruptible  locks  in  a  system  of  periodic  tasks. 

1.3.2  Implementation 

In  the  following  discussion,  we  assume  that  if  a  process  experiences  a  context 
switch  while  executing  an  ICS,  the  process  re-executes  from  the  start  of  the  ICS  when 
it  regains  control  of  the  CPU  (as  in  [11]).  In  section  4.4,  we  discuss  the  modification 
necessary  to  permit  re-execution  only  when  a  conflicting  operation  commits.  The 
modification  is  minor,  but  the  fully  general  algorithm  would  confuse  the  current 
discussion. 

In  [97],  Turek  et  al.  propose  a  method  for  transforming  locking  data  structures 
into  non-blocking  data  structures.  The  key  to  the  transformation  is  to  post  a  contin- 
uation instead  of  a  lock.  The  continuation  contains  the  modifications  that  the  process 
intends  to  perform.  If  a  process  attempts  to  post  a  continuation  but  is  blocked  (be- 
cause a  continuation  is  already  posted),  the  'blocked'  process  performs  the  actions 
listed  in  the  continuation,  removes  the  continuation,  then  re-attempts  to  post  its  own 
continuation.  As  a  result,  a  blocked  process  can  unblock  itself. 

Although  Turek's  approach  simplifies  the  process  of  writing  a  critical  section,  a 
direct  translation  of  Turek's  algorithm  can  require  a  high  priority  process  to  perform 
the  work  of  many  low  priority  processes  that  have  posted  but  not  yet  performed  their 
actions.  An  easy  modification  of  Turek's  approach  results  in  a  simple  algorithm  which 
guarantees  that  a  high  priority  process  does  the  work  for  at  most  one  low  priority 
process.  We  present  an  algorithm  of  an  ICS  based  on  this  approach  here.  We  note 
that  one  can  write  an  ICS  by  a  rather  different  approach,  the  details  of  which  are 
contained  in  [44]. 

Every  shared  concurrent  object  has  a  single  commit  record,  and  a  flag  indicating 
whether  the  commit  record  is  valid  or  invalid.  When  a  process  starts  executing  a 
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critical  section,  it  check  to  see  if  a  previous  operation  left  an  unexecuted  commit 
record  (the  flag  is  valid).  If  so,  the  process  executes  the  writes  indicated  by  the 
commit  record,  then  sets  the  flag  to  invalid.  The  process  then  performs  its  opera- 
tion, recording  all  intended  writes  in  the  commit  record.  For  the  decisive  instruction, 
the  process  sets  the  flag  to  valid.  A  typical  critical  section  has  the  following  form. 

struct  commit jrecord_element{ 

word  *lhs,rhs}  commit .record [MAX] 
boolean  valid 

critical_section() 

interruptible_critical_section{ 
if (valid) 

instruction=0 

while ( instruct ion<MAX  and 

commit_record [instruct ion] .lhs  !=  NULL) 
* (commit -record [instruction] . lhs) = 

commit_record [instruction] .rhs 
valid=FALSE 
calculate  modifications 
load  modifications  into   commit  -record 
valid=true 

} 


For  example,  the  following  code  inserts  a  record  in  a  doubly  linked  list.  Other  list 
operations  are  similar. 

struct  list_elem{ 
data  item; 

struct  list_elem  *f orward , *backward ; 
}  *head; 

struct  commit_record_element{ 

word  *lhs,rhs}  commit_record [2] 
boolean  valid 

insert (elem) 
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list.elem  *elem 

list_elem  *prev,*next 
interruptible_critical_section{ 
if (valid) 

instruction=0 

while (instruction<2  and 

commit jrecord [instruction] .lhs  !=  NULL) 
* (commit jrecord [instruction] . lhs) = 

commit jrecord [instruction] .rhs 
valid=FALSE 

prev=NULL;  next=head 
while(not  jf  oundjposition(next) ) 

prev=next;  next=next— >  forward 
//  Found  the  insertion  point 
elem— >f orward=next ;  elem— ►backward=prev 
if (prev==NULL) 

commit_record[0] .lhs=&head 
else 

commit jrecord [0] . lhs=& (prev— »f orward) 
commit jrecord [0] .rhs=elem 
if (next  !=  NULL) 

commit jrecord [1] . lhs=& (next— ^backward) 

commit jrecord [1] .rhs=elem 
else 

commit_record[l] .lhs=NULL 
valid=TRUE 

} 

The  transformation  from  a  blocking-based  critical  section  to  an  ICS  is  straightfor- 
ward. The  cleanup  phase  is  inserted  in  the  beginning  of  the  critical  section.  Whenever 
a  write  is  performed  into  global  data  in  the  blocking-based  critical  section,  the  write 
is  recorded  in  the  commit  record  in  the  ICS.  The  last  statement  of  the  ICS  is  to  set 
valid  to  TRUE.  If  operations  perform  few  writes,  then  a  high  priority  task  performs 
at  most  a  few  instructions  on  behalf  of  a  low  priority  task.  Further,  the  costs  bal- 
ance because  the  high  priority  task  leaves  the  commit  record  for  a  different  task  to 
execute.  In  a  blocking-based  approach,  the  high  priority  task  would  incur  a  context 
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switch,  thus  costing  the  context  switch  overhead  and  also  overhead  due  to  cache  line 
invalidations. 

4.3.3    Reducing  the  Clean-up 

If  the  critical  section  requires  a  small  modification  (or  can  be  broken  into  several 
sections,  each  requiring  only  a  small  modification),  then  the  basic  approach  allows 
a  low  priority  operation  to  block  a  high  priority  operation  for  only  a  short  period. 
If  an  operation  performs  a  substantial  modification  and  the  number  of  modifications 
that  an  operation  commits  might  vary  widely,  then  a  high  priority  operation  might 
spend  a  substantial  amount  of  time  performing  a  low  priority  operation's  updates  to 
the  data  structure. 

In  [33],  Herlihy  proposes  a  'shadow-page'  method  for  implementing  a  non-blocking 
concurrent  data  structure.  An  operation  calculates  its  modifications  to  the  data 
structure  in  set  of  privately  allocated  (shadow)  records,  then  links  its  records  into 
the  data  structure  in  its  decisive  instruction.  The  process  is  illustrated  in  Figure  4.1. 
The  blocks  in  the  data  structure  marked  'g'  are  replaced  by  the  shadow  blocks.  An 
operation  performs  its  decisive  instruction  by  swapping  the  anchor  pointer  from  the 
current  root  to  the  shadow  root.  The  blocks  that  are  removed  from  the  data  structure 
are  garbage  collected  by  the  successful  operation  and  are  (eventually)  made  available 
to  other  operations.  We  note  that  the  decisive  instruction  always  must  be  to  swap 
the  anchor,  in  order  to  ensure  serializability  in  a  parallel  system. 

The  most  complicated  part  of  Herlihy's  protocol  is  managing  the  garbage-collected 
records.  The  protocols  are  complex,  and  require  0(P2)  space,  where  P  is  the  number 
of  processes  that  access  the  shared  object.  We  can  take  advantage  of  the  serial  access 
to  the  data  structure  in  the  ICS  to  simplify  the  implementation  and  reduce  the  space 
overhead. 
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Figure  4.1.  Herlihy's  non-blocking  data  structures 


The  process  of  implementing  a  shadow-page  ICS  is  illustrated  in  Figure  4.2.  A 
process  obtains  the  records  it  needs  to  prepare  its  modifications  from  a  global  stack 
of  records.  The  global  record  stack  provides  the  records  for  all  operations  that  use 
records  of  the  size  it  stores.  When  a  process  obtains  a  record  from  the  global  stack, 
it  does  not  remove  the  record  from  the  stack.  Instead,  the  modifications  are  made 
to  records  while  they  are  still  on  the  stack.  A  local  variable,  current,  keeps  track 
of  the  last  allocated  record  from  the  record  stack.  Another  pair  of  local  variables, 
g_head  and  g_tail,  keep  track  of  the  records  to  be  removed  from  the  data  structure. 
To  commit  the  modification  to  the  data  structure,  the  operation  must  remove  the 
records  it  used  from  the  stack  of  global  records,  add  the  garbage  records  to  the  global 
stack,  and  adjust  a  pointer  in  the  data  structure.  These  three  modifications  can  be 
performed  using  a  regular  commit  record. 

Before  listing  the  procedures  to  implement  the  shadow-page  ICS,  we  note  a  couple 
of  details.  First,  every  record  in  the  data  structure  must  contain  enough  additional 
space  to  thread  a  list  through  it,  whether  the  garbage  list  or  the  global  record  stack. 
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Figure  4.2.  Shadow-page  ICS 

Second,  the  critical  instruction  of  the  operation  is  to  declare  that  the  commit  record 
is  valid.  As  a  result,  the  commit  record  can  contain  instructions  to  change  any  links 
in  the  data  structure.  As  an  example,  in  Figure  4.2,  a  link  from  the  root  instead  of 
the  anchor  is  modified. 

We  assume  that  every  record  has  a  field  next  that  is  used  to  thread  the  global 
record  and  the  garbage  lists  through  the  nodes.  The  procedure  for  acquiring  a  new 
record  is 

record  *getbuf (record  **current) 
buffer  *temp 

temp=*current 

*  current = ( *  current )  — mext 

The  procedure  to  declare  that  a  node  is  garbage  is  given  by 

garbage  (record  *elem ,  **g_head ,  **g_tail) 
if  (*g_tail==NULL) 

*g_tail=elem 
elem— >next=*g_head 
*g_head=elem 

A  typical  critical  section  is  given  by 
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struct  commit_record_element{ 

word  *lhs,rhs}  commit_record[3] 
boolean  valid 
Global  record  *pool 

crit  ical  _sect  ion  () 
record  *current  ,*g_head,*g_tail 
restartable{ 

if (valid) 

instruct ion=0 
while(instruction<3  and 

commit_record [instruction] . lhs  !=  NULL) 
* (commit_record [instruction] . lhs) = 

commit_record [instruction] .rhs 
valid=FALSE 

//  Initialize  the  list  pointers 
current=pool 
g_head=g_tail=NULL 

Compute  the  modifications  to  the  data  structure 
using  the  getbuf  and  garbage  procedures 

//  Prepare  the  commit  record 

commit_record[0]  . lhs=&(g_tail— mext) 
commit_record[0] .rhs=current 
commit_record[l] .lhs=&pool 
commit_record[l]  .rhs=g_head 
commit_record[2]  .  lhs=critical JLink 
commit_record  [2]  .  rhs=crit icalJLink_value 


} 


valid=TRUE  //  commit  your  update 


The  shadow-page  ICS  requires  that  a  high  priority  operation  perform  at  most 
three  writes  on  the  behalf  of  a  low  priority  operation  when  the  shared  data  structure 
is  a  tree.  Arbitrary  graph  structures  might  require  more  updates,  but  the  technique 
has  a  similar  application.  Since  a  high-priority  operation  does  not  perform  its  own 
clean-up,  the  costs  balance,  and  again  the  high  priority  task  avoids  the  context  switch 
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overhead.  The  space  requirements  for  a  shadow  page  ICS  are  independent  of  the 
number  of  competing  processes,  as  the  global  pool  must  be  initialized  with  enough 
records  to  allow  the  data  structure  to  reach  its  maximum  size,  plus  the  number 
of  records  in  the  largest  modification.  Furthermore,  the  global  pool  can  be  shared 
among  several  data  structures  (in  which  case  they  must  share  a  commit  record).  The 
linked  list  that  is  threaded  through  the  data  structure  imposes  an  0(1)  penalty  on 
every  node  in  the  data  structure. 

4.4    System  Support 

If  an  interruptible  critical  section  is  to  be  efficient,  then  a  process  executing  one 
should  be  restarted  only  if  a  conflicting  operation  occurs.  Thus,  information  about 
critical  section  executions  must  be  transmitted  to  the  kernel.  In  this  section,  we  de- 
scribe a  simple  and  efficient  method  of  providing  kernel-level  support  for  interruptible 
critical  sections. 

An  operating  system  must  have  a  small  context  switch  overhead  to  achieve  good 
performance.  Thus,  the  context-switch  time  support  for  an  ICS  must  be  limited  to  a 
minimum.  However,  we  would  like  to  make  the  mechanism  as  flexible  as  possible.  In 
addition,  we  would  like  to  avoid  making  kernel  traps  to  set  up  a  request  for  critical 
section  entry. 

To  be  efficient,  information  about  conflicting  executions  must  be  passively  trans- 
mitted to  the  kernel.  With  every  critical  section,  we  associate  an  execution  count, 
cs_count.  When  a  process  enters  a  critical  section,  it  reads  cs_count  into  a  local 
variable,  processjcount.  When  the  process  completes  the  critical  section,  it  incre- 
ments cs_count.  Thus,  the  kernel  can  detect  that  a  conflicting  operation  occurred 
when  the  process_count  of  the  switched-in  process  is  different  than  the  cs.count 
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of  the  critical  section  being  executed.  We  use  this  mechanism  as  the  basis  of  our 
context  switch  support  for  the  ICS. 

Every  critical  section  has  a  control  block  with  the  following  information 

1.  The  starting  and  ending  location  of  the  critical  section  code. 

2.  The  cs.count. 

3.  Additional  structures  necessary  to  implement  interruptible  critical  sections. 

Every  process  that  executes  interruptible  critical  sections  has  a  block  of  memory 
in  user/kernel  space  that  contains  the  following  variables. 

1.  A  flag  that  is  set  if  the  process  is  executing  an  interruptible  critical  section. 

2.  A  pointer  to  the  critical  section  control  block. 

3.  process_count. 

On  a  context  switch,  the  kernel  executes  the  following  code  before  giving  control 
to  the  switched-in  process. 

If  the  next  process  to  run  is  executing  an  ICS 
Find  the  critical  section  control  block 
If  the  program  counter  of  the  next  process  to  run 
is  inside  the  ICS 
If  cs_count  !=  process_count 

Set  the  program  counter  of  the  next  process  to  run 
to  the  start  of  the  ICS. 

To  take  advantage  of  the  kernel  mechanism,  the  process  loads  cs_count  into 
process-count  before  entering  the  ICS,  and  increments  cs.count  before  exiting  the 
ICS.  The  following  is  a  first  attempt  at  writing  the  entry  and  exit  code  for  an  ICS. 
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//  Entry  code 

Make  the  process'  control  block  point  to  the 

critical  section  control  block. 
Set  the  flag  in  the  process'  control  block. 
process_count=cs_count 
BeginJECS:    //  start  of  the  ICS 

//  Exit  code 

End_ICS :  cs_count++ 

reset  the  flag  in  the  process'  control  block. 

The  problem  with  the  above  entry  and  exit  code  is  that  it  doesn't  cooperate 
with  the  code  that  implements  the  interruptible  critical  section.  The  ICS  expects 
that  the  last  instruction  in  the  restartable  region  sets  valid  to  TRUE.  If  valid  is 
set  before  cs_count  is  incremented,  then  an  incorrect  execution  can  result  (either 
an  operation  is  executed  twice,  or  a  committed  operation  is  ignored).  If  cs_count 
is  incremented  before  valid  is  set,  then  a  process  can  cause  itself  to  restart.  We 
can  avoid  these  race  conditions  by  having  a  single  write  that  both  commits  the 
operation  and  increments  cs.count.  With  each  critical  section,  we  associate  a  second 
execution  counter,  aux.count  that  normally  has  the  same  value  as  cs_count.  The  last 
instruction  in  the  restartable  region  increments  cs.count.  A  process  can  detect  that 
an  operation  has  recently  committed  by  testing  aux.count  and  cs_count  for  equality. 
If  they  are  different,  the  process  performs  the  writes  of  the  previous  operation.  The 
process  signals  that  all  of  the  updates  are  performed  by  setting  aux_count=cs_count. 

There  is  one  remaining  problem.  When  two  operations  execute  concurrently, 
they  interfere  when  they  record  their  writes  in  the  commit  record.  If  the  system 
uses  strict  priority  scheduling,  a  high  priority  operation  will  overwrite  the  concurrent 
lower  priority  process'  updates  to  the  commit  record,  then  force  the  lower  priority 
process  to  restart.  If  the  executions  of  the  two  operations  can  be  interleaved,  then 
they  must  have  their  own  commit  records  to  record  their  updates.  But  then,  when 
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an  operation  commits  it  must  indicate  which  commit  record  contains  update.  This 
is  done  by  incrementing  cs_count  by  the  commit  record  index  when  committing. 
The  new  exit  code  is 

//  Exit  code 

End_ICS:  cs_count+=process  .number 

reset  the  flag  in  the  process'  control  block. 

The  code  in  the  ICS  to  detect  and  perform  a  committed  operations  updates  is 

index=cs_count-auxjCOunt 
if (index !=0) 

instruct ion=0 

while (instruction<MAX  and 

commit_record [instruction] [index] . lhs  !=  NULL) 
*(commit_record [instruction] [index] .lhs)= 
commit -record [instruction] [index] .rhs 
aux  _c  o  unt = c  s  jc  oun  t 

As  an  optimization,  we  note  that  an  operation  that  queries  the  data  and  performs 
no  updates  does  not  need  to  force  other  operations  to  restart.  Queries  can  be  imple- 
mented using  the  same  methods  as  for  updating  operations,  except  that  queries  do 
not  modify  cs_count. 

4.5    Interruptible  Locks 

Interruptible  critical  sections  let  high  priority  operations  execute  quickly  at  the 
expense  of  making  low  priority  operations  execute  slowly.  If  too  many  tasks  are  al- 
lowed to  enter  a  critical  section  without  blocking,  low  priority  tasks  might  experience 
an  excessive  number  of  restarts,  increasing  their  response  time  and  decreasing  the 
schedulability  of  the  task  set.  We  can  reduce  the  unpredictability  of  the  low  prior- 
ity operations  by  letting  only  the  highest  priority  tasks  execute  the  critical  section 
without  locking.  Moderate  to  low  priority  tasks  must  acquire  a  semaphore  to  exe- 
cute in  the  critical  section.  As  our  results  section  shows,  this  greatly  improves  the 


73 


predictability  of  a  set  of  real-time  tasks.  Furthermore,  tasks  that  must  acquire  a 
semaphore  can  be  required  to  use  the  priority  ceiling  protocol.  Our  analysis  section 
shows  that  a  combination  of  interruptible  locks  and  the  priority  ceiling  improves  the 
schedulability  of  the  low  priority  tasks. 

The  entry  and  exit  code  is  changed  to  the  following  (we  assume  here  that  lower 
priority  numbers  mean  lower  priorities). 

//  Entry  code 

Make  the  process'  control  block  point  to  the 

critical  section  control  block. 
Set  the  flag  in  the  process'  control  block. 
process_count=cs_count 
if  (my  priority  <  cutoff  .priority) 
P(S) 

BeginJECS:    //  start  of  the  ICS 

//  Exit  code 

End_ICS:  cs_count+=process_number 

if (my  priority  <  cutoff -priority) 
V(S) 

reset  the  flag  in  the  process'  control  block. 

Interruptible  locks  also  reduce  the  space  requirements  for  an  ICS  with  multitask- 
ing processes.  Since  the  processes  which  set  a  lock  will  not  execute  concurrently,  they 
can  share  a  commit  record.  In  a  typical  use  of  interruptible  locks,  only  one  very  high 
priority  process  will  be  able  to  interrupt  the  lock,  so  only  two  commit  records  are 
needed. 

4.6  Implementation 

We  implemented  ICS  support  in  a  VMEexec  [76]  system  development  environment 
with  a  pSOS-f  [75]  real-time,  multi-tasking  operating  system  kernel.  The  VMEexec 
system  consists  of  a  host  running  on  a  VMEmodule  driven  SYSTEM  V/68  operating 
system  and  a  set  of  VMEmodule  target  processors  running  the  pSOS+  kernel.  In  our 
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configuration,  we  have  six  MVME147  VMEmodules  based  on  Motorola  MC68030 
with  4Mb  of  shared-memory  on  each  module.  One  VMEmodule  is  used  as  a  host 
processor  running  the  SYSTEM  V/68  and  the  rest  are  real-time  target  processors 
running  the  pSOS+  kernel.  For  the  experiments  described  in  this  chapter,  we  made 
use  of  only  one  of  the  target  processors. 

pSOS+  is  a  real-time,  multi-tasking  kernel  that  supports  multi-processors.  It 
provides  a  rich  set  of  system  services  including  task  management,  shared-memory 
regions,  synchronous  /  asynchronous  signals,  semaphores,  and  messages.  One  partic- 
ular feature  that  pSOS+  supports  are  user  written  routines  that  can  be  called  at  the 
start  of  a  task,  during  a  context-switch,  and  at  the  end  of  a  task.  This  feature  allows 
us  to  implement  ICS  support  without  modifying  the  kernel. 

We  use  two  data  structures  to  implement  the  ICS:  one  for  the  critical  section  and 
one  for  each  task  that  uses  the  critical  section.  The  global  lock  structure  consists 
of  a  critical  section  identifier,  a  counter  that  tracks  the  number  of  times  the  critical 
section  has  been  executed,  and  the  critical  section  bounds. 

struct  ICSJLstruct  { 

int      id;  /*  ID  of  this  critical  section  */ 

int     cs_count;  /*  Global  Execution  Count  */ 
char    *cslow;     /*  CS  Low  Address  */ 
char    *cshigh;   /*  CS  High  Address  */ 

} 

The  structure  local  to  a  task  consists  of  the  copy  of  the  ICS  execution  count,  a 
count  of  the  number  of  times  the  critical  section  is  retried  on  any  invocation  (for 
statistics),  a  pointer  to  the  ICSstruct  and  a  flag  to  indicate  that  the  task  is  entering 
the  critical  section. 


75 


struct  ICS_Tstruct  { 

int  process_count ;  /* 

struct      ICS_Lstruct  *ilp;  /* 

int  i count;  /* 

int  flag;  /* 


Local  Execution  Count  */ 
Interruptible  Lock  Record  Pointer*/ 
Interrupt  Count  of  a  task  */ 
Flag  =  ID  =>  In  CS;  =  0  =>  Not  */ 


The  ICS  implementation  code  consists  of  two  parts:  the  ICSctxsw  routine  which 
provides  the  ICS  Lock  mechanism  and  the  ICSclient  task  that  uses  the  ICS  mech- 
anism. We  have  already  discussed  the  algorithms  used  by  the  ICSclient  task  in 
Section  4.4 

4.6.1    ICSctxsw  Routine 

The  ICSctxsw  routine  is  integrated  with  the  pSOS+  kernel  as  a  user  written 
routine  that  is  called  during  a  context-switch.  The  call  occurs  at  the  point  where  the 
context  of  the  switched-out  task  has  been  completely  saved,  and  before  the  context 
of  the  switched-in  task  is  loaded.  pSOS+  provides  the  addresses  of  the  Task  Control 
Blocks  (TCBs)  of  both  the  switched-in  task  and  the  switched-out  task  in  machine 
registers.  The  TCB  contains  all  the  context  of  a  task,  including  the  Program  Counter 
(PC).  ICSctxsw  can  reset  the  PC  in  the  TCB  of  a  switched-in  task,  if  required. 
pSOS-f  provides  a  set  of  eight  software- defined  user  registers  USRs)  that  a  task  can 
access  in  the  TCB.  The  user  register  0,  U-REGO,  is  used  to  contain  the  address  of 
the  ICS.Tstruct  of  a  task  using  ICS. 

ICSctxswO 
{ 

struct  tcb  *in_tcb; 
struct  ICS_Tstruct  *tlp; 

load  in_tcb  from  the  machine  register; 
tip  =  Get  U_REG0  from  in_tcb; 
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if (tip  !=  NULL  &&  tlp->flag  ==  LOCKID) 
{ 

if (tcb->currpc  >=  tlp->ilp->cslow  && 

tcb->currpc  <=  tlp->ilp->cshigh) 

{ 

if  (tlp->process_count  !=  tlp->ilp->cs_count) 
in_tcb->currpc  =  tlp->ilp->cslow; 

} 

} 


ICSctxsw  checks  if  the  program  counter  (PC)  of  the  task  about  to  be  run  is  within 
the  critical  section  region,  and  if  so,  it  decides  on  the  criteria  to  reset  the  PC  to  the 
beginning  of  the  critical  section.  If  the  criteria  is  met,  the  task  is  forced  to  re-execute 
the  critical  section.  Otherwise,  the  task  is  allowed  to  continue. 

4.6.2    User-level  Entry  and  Exit 

The  ICS  entry  and  exit  code  that  is  used  in  conjunction  with  the  ICSctxsw  routine 
must  (in  general)  be  kernel  calls,  because  the  task  control  block  is  modified.  To 
permit  user-level  synchronization,  the  entry  and  exit  calls  must  be  designed  so  that 
bad  parameters  cannot  be  passed. 

Instead  of  storing  the  critical  section  ICS  control  block  ( I CS_L struct)  in  arbitrary 
locations,  they  are  stored  in  an  array  in  kernel  space.  Registering  an  ICS  requires  a 
call  to  fill  in  one  of  the  control  blocks.  In  the  task  ICS  control  block  (iCSJTstruct), 
we  store  the  index  of  the  control  block  of  the  ICS  that  is  being  executed  instead 
a  pointer  to  it  (or  0  if  no  ICS  is  being  executed).  In  the  context  switch  routine 
(ICSctxsw),  the  index  to  the  ICS  control  block  is  used  in  place  of  the  reference.  If 
the  number  of  allowed  ICS  control  blocks  is  a  power  of  2,  then  bounds  checking  can 
be  done  by  masking  out  the  high  order  bits  of  the  index  in  the  task  ICS  control  block. 
An  invalid  index  causes  no  problems,  since  the  PC  won't  be  in  the  specified  range. 
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4.7    Experimental  Performance  Results 

We  tested  the  performance  of  interruptible  locks  on  a  shared  priority  queue.  There 
are  three  low  priority  enqueue  tasks  (of  equal  priority)  and  a  single  high  priority 
dequeue  task.  This  experiment  corresponds  to  several  computational  tasks  providing 
data  for  a  high  priority  I/O  task.  All  four  tasks  are  started  under  the  control  of  a 
low  priority  parent  task.  The  parent  and  the  tasks  communicate  through  message 
queues. 

We  compared  4  types  of  mechanisms. 

1.  Interruptible  Critical  Sections:  All  tasks  immediately  enter  the  ICS. 

2.  Interruptible  Locks:  The  enqueuing  tasks  set  and  release  a  semaphore,  the 
dequeuing  task  does  not. 

3.  Non-prioritized  Semaphore  Locks:  All  of  the  tasks  acquire  a  semaphore  before 
entering  the  critical  section.  The  semaphore  lock  is  granted  on  FCFS  basis. 

4.  Prioritized  Semaphore  Locks:  Same  as  the  above,  but  the  semaphore  is  granted 
on  a  priority  basis. 

Parameters  In  the  first  experiment,  each  task  performs  10,000  enqueue  (de- 
queue) operations,  but  we  stop  collecting  statistics  after  any  task  completes.  Each 
enqueue  task  spins  for  7  ticks  (about  70  ms),  then  executes  a  1  tick  critical  section. 
The  time  quantum  for  a  task  is  2  ticks.  The  dequeue  task  sleeps  for  10  ticks  before 
entering  a  1  tick  critical  section. 

We  collect  the  time  to  execute  a  critical  section,  and  we  create  histograms  that 
show  the  frequency  that  the  critical  section  execution  takes  a  particular  amount  of 
time.   The  performance  of  the  non-prioritized  semaphore  is  shown  in  Figure  4.3. 
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The  dequeue  operation  sometime  experiences  a  long  delay,  and  the  time  to  execute 
enqueue  operations  is  moderate.  Since  pSOS+  offers  prioritized  semaphores,  a  fairer 
comparison  should  use  them.  This  data  is  shown  in  Figure  4.4.  There  is  a  slight 
decrease  in  the  dequeue  and  enqueue  response  times,  but  still  the  dequeue  operation 
experiences  a  long  delay  a  few  times.  In  Figure  4.5,  we  show  the  time  to  execute  the 
enqueue  and  dequeue  critical  section  using  interruptible  critical  sections.  The  dequeue 
operation  is  always  performed  without  delay,  and  the  enqueue  operations  perform  as 
well  as  when  using  the  prioritized  semaphores.  In  Figure  4.6,  we  use  interruptible 
lock.  The  performance  is  comparable  to  the  interruptible  critical  sections. 

Using  an  ICS  can  cause  poor  performance  among  low  priority  tasks  if  the  critical 
sections  have  a  high  utilization.  In  the  second  experiment,  the  enqueue  task  spins 
for  2  ticks  instead  of  7  ticks,  and  then  executes  a  4  tick  critical  section.  The  dequeue 
task  sleeps  for  20  ticks  and  executes  a  1  tick  critical  section.  These  parameters  are 
selected  to  exaggerate  the  conflicts  among  the  tasks,  to  show  the  restart  problems 
that  using  an  ICS  can  cause. 

Figure  4.7  shows  the  time  to  execute  the  enqueue  and  dequeue  critical  section 
using  interruptible  critical  sections.  We  note  that  the  scale  on  this  chart  is  non- 
linear. The  dequeue  operation  is  again  performed  without  delay,  but  the  enqueue 
operation  can  take  an  extremely  long  time  to  execute.  In  contrast,  Figure  4.8  shows 
the  usage  of  interruptible  locks  in  which  the  time  to  execute  an  enqueue  operation  is 
moderate. 

Comparing  the  interruptible  lock  and  the  prioritized  semaphore  implementations 
for  the  critical  sections,  we  find  that  the  interruptible  lock  eliminates  the  delay  in 
executing  the  high  priority  critical  section,  while  adding  only  a  small  delay  (in  this 
case  about  20%)  to  the  time  to  execute  a  low  priority  critical  section. 
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The  significance  of  these  experiments  is  not  that  the  average  response  time  of 
the  high  priority  dequeue  operation  is  reduced,  but  that  the  response  times  of  the 
dequeue  operations  becomes  predictable.  In  the  low-conflict  experiment,  the  dequeue 
operation  usually  completes  immediately,  but  on  occasion  requires  5  ticks  when  a 
prioritized  semaphore  is  used.  This  unpredictability  might  cause  timing  problems. 
We  note  that  the  priority  ceiling  protocol  would  provide  the  same  performance  as 
the  prioritized  semaphore  does  (since  there  are  no  other  critical  sections),  but  at 
the  cost  of  a  more  complex  and  expensive  scheduler  and  synchronization  mechanism. 
Interruptible  locks  always  allow  the  dequeue  operation  to  finish  immediately,  even 
under  a  very  high  load. 

To  test  the  overhead  of  using  interruptible  critical  sections,  we  ran  experiments  to 
time  the  overhead  in  the  context  switch  code  and  in  the  ICS  entry  code.  In  the  first 
experiment,  we  enter  and  exit  a  (empty)  critical  section  10,000  times.  We  found  that 
acquiring  and  releasing  a  semaphore  adds  57  ticks  to  the  program  execution  time. 
Entering  and  exiting  an  ICS  requires  67  ticks  if  the  entry  and  exit  code  is  contained 
in  a  system  call,  and  1  tick  if  the  entry  and  exit  code  are  user  code.  In  the  second 
experiment,  we  force  10,000  context  switches.  We  find  that  the  unavoidable  context 
switches  by  themselves  require  58  ticks,  and  the  ICS  callout  code  adds  9  ticks  of 
overhead.  These  numbers  are  for  the  current  implementation  of  ICS  and  interruptible 
locks.  An  implementation  that  is  more  tightly  integrated  into  the  kernel  will  require 
less  overhead.  For  example,  if  the  context-switch  code  is  part  of  the  kernel,  then 
there  is  no  need  for  the  callout  routine  overhead  and  we  would  have  faster  access  to 
the  program  counter. 
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4.8  Analysis 

Hard  real-time  systems  require  timing  guarantees,  and  for  this  reason  one  typically 
considers  periodic  task  sets.  In  this  section,  we  show  how  to  analyze  a  periodic  task 
set  that  uses  an  ICS  or  an  interruptible  lock  for  synchronization,  and  derive  worst-case 
response  times. 

The  set  of  tasks  is  {r;}.  We  use  the  convention  that  r,  has  a  higher  priority 
than  Tj  if  i  <  j.  Each  task  i  has  a  period  T,  and  a  worst  case  execution  time  C,-. 
An  instantiation  of  task  i  is  released  at  the  beginning  of  its  period,  and  has  for  a 
deadline  the  release  of  the  next  instantiation  of  the  task.  If  all  tasks  can  always 
complete  before  their  deadline,  then  the  task  set  is  feasible.  A  real-time  system  with 
periodic  tasks  is  typically  scheduled  using  the  Rate  Monotonic  algorithm,  which  gives 
static  preemptive  priority  to  tasks  with  shorter  periods.  Rate  Monotonic  scheduling 
is  well  studied,  and  the  feasibility  of  a  task  set  can  be  exactly  characterized.  Let  r,- 
be  the  worst  cast  response  time  of  task  i.  If  the  tasks  do  not  access  critical  sections 
then,  r,  is  the  fixed  point  of  the  following  recursive  equation  [48] 

ri=  Ci  +  Z^n/TjlCj  (4.1) 

Unfortunately,  it  is  not  always  realistic  to  assume  that  a  task  is  released  for 
execution  at  the  start  of  its  period.  For  example,  most  static  priority  tasks  schedulers 
are  implemented  using  tick  scheduling  -  a  periodic  clock  interrupt  polls  the  task  set 
and  performs  a  context  switch  if  a  task  with  a  higher  priority  than  the  current  one  is 
ready  for  execution.  The  release  of  a  task  can  also  be  blocked  by  external  events,  such 
as  the  arrival  of  a  message  from  a  communicating  task  in  a  distributed  system  [57]. 
The  task  set  might  be  subject  to  release  jitter  [94],  possibly  due  to  tick  scheduling 
or  due  to  waiting  for  external  events.  Tindell,  Burns,  and  Wellings  show  how  to 
modify  equation  4.1  to  account  for  release  jitter.  In  each  of  these  cases,  the  deadline 
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of  the  task  can  be  less  than  the  task  period,  perhaps  significantly  less.  If  the  task 
deadlines  are  shorter  than  the  task  periods,  then  the  Deadline  Monotonic  algorithm 
is  the  optimal  static  scheduler  [58]. 

If  the  tasks  can  access  critical  sections  and  thus  experience  blocking,  then  the 
maximum  blocking  time  is  added  to  the  above  r,  value.  If  the  Priority  Ceiling  Protocol 
is  used,  a  task  will  block  for  at  most  the  duration  of  one  critical  section.  If  the 
tasks  use  interruptible  locks,  then  there  is  an  additional  re-execution  component  that 
must  be  added  to  the  response  times.  We  next  compute  the  time  wasted  due  to 
re-executions  of  critical  sections. 

We  assume  that  interruptible  locks  are  used  in  conjunction  with  the  Priority 
Ceiling  Protocol.  So,  for  each  critical  section,  there  is  a  (possible  empty)  set  of  tasks 
that  enter  the  critical  section  without  blocking,  and  another  (possible  empty)  set  of 
tasks  that  acquire  a  semaphore  before  entering  the  critical  section. 

The  tasks  are  described  by  their  periods  Ti,  their  execution  times  (in  the  absence 
of  concurrency)  C,-,  the  set  of  critical  sections  that  they  access  Z,-,  the  execution  times 
in  their  critical  sections  6,iZ,  and  their  deadlines  Z),.  Hence,  we  assume  that  tasks 
priorities  are  assigned  according  to  the  length  of  £);.  Let  Z  be  the  set  of  critical 
sections  (Z  =  UZ,),  and  for  z  €  Z,  let  Tj ,  Tj  ,  . . . ,  r*  be  the  set  of  tasks  that  access 
z,  in  order  of  priority.  Of  the  nz  tasks,  the  Iz  highest  priority  tasks  enter  z  without 
blocking,  and  the  remaining  nz  —  Iz  acquire  a  semaphore  using  the  Priority  Ceiling 
Protocol.  We  define  I(z)  to  be  the  highest  numbered  task  that  enters  z  without 
blocking  (i.e.,  r/(2)  =  r/J. 

Let  us  first  consider  a  couple  of  simple  examples.  Suppose  that  we  have  a  set  of 
three  tasks  that  each  access  the  semaphore  z.  The  characteristics  of  the  tasks  are 
listed  in  Table  4.1.  Note  that  this  task  set  cannot  be  guaranteed  to  meet  its  deadlines 
if  the  Priority  Ceiling  Protocol  is  used,  as  C\  +  max,(61)2)  >  D\. 
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Table  4.1.  Example  task  description  1  for  ICS 


task 

Ti 

Ci 

Zi 

Di 

Tl 

10  ms 

2.5  ms 

z 

1  ms 

3  ms 

r2 

15 

5 

z 

1 

10 

r3 

30 

4 

z 

1 

28 

If  the  semaphore  z  is  protected  by  an  ICS,  then  task  T\  can  always  meet  its 
deadline  because  C\  <  D\.  The  question  is  whether  the  remaining  tasks  can  meet 
their  deadlines.  To  analyze  task  r2,  we  observe  that  every  time  that  T\  interrupts  r2, 
the  result  can  be  a  re-execution  of  z.  So,  to  determine  the  response  time  r2  of  r2,  we 
modify  equation  4.1  to  incorporate  the  re-execution  time. 

r2=   C2+\r2/T1](C1  +  b2<z)  (4.2) 

By  solving  equation  4.2,  we  find  that  r2  =  8.5  <  D2.  To  determine  r3,  we  observe 
that  both  Ti  and  r2  can  cause  a  re-execution  in  t3.  However,  an  execution  of  T\  can 
either  cause  a  re-execution  in  r2  or  a  re-execution  in  r3,  but  not  both.  In  either  case, 
r3  is  increased  by  one  execution  of  z.  Therefore,  the  formula  for  r3  is 

r3  =   C3+  [ra/Txl  (C,  +  b2<z)  +  \r3/T2]  (C2  +  b3>2)  (4.3) 

We  find  that  r3  =  26.5  <  Z)3,  so  r3  can  always  meet  its  deadline.  We  conclude 
that  the  task  set  in  Table  4.1  is  feasible  if  synchronization  is  done  with  an  ICS,  but 
is  infeasible  if  the  synchronization  is  done  using  the  Priority  Ceiling  Protocol. 

We  can  observe  a  general  method  for  computing  worst-case  response  times  of  the 
tasks,  if  all  critical  sections  are  protected  by  ICSs  only.  We  define 

b{z;  i,  i)  =   maxj<fc<,-  b(k,  z)   3  k  st.  j  <  k  <  i  and  z  €  ZkD  Zj 
=  0  Otherwise 
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Thus,  b(z;  j,  i)  is  the  longest  critical  section  that  can  be  interrupted  by  tj  and 
increase  the  response  time  of  r,-. 

Next,  we  need  to  determine  increase  in  response  time  of  task  77  caused  by  exe- 
cutions of  Tj.  If  Tj  causes  a  re-execution  of  critical  section  z,  then  only  one  critical 
section  must  be  re-executed  because  of  tj.  If  and  77  are  executing  z  when  Tj  ex- 
ecutes z  (j  <  k  <  /),  then  both  and  77  must  re-execute  2.  However,  77  would 
need  to  re-execute  anyway,  because  of  77/ s  execution.  Furthermore,  all  tasks  with  a 
lower  priority  than  rjt  have  their  response  time  increased  by  r^'s  re-execution  time. 
Therefore,  the  increase  in  t,  's  response  time  due  to  an  execution  of  tj  is 

btot(j,  i)  =   maxz6Z  b(z;  j,  i) 

Then,  the  response  time  of  task  77  is  the  solution  of 

ri  =   d  +  Ej<i  Iri/Tj]  (Cj  +  btot(j,  j))  (4.4) 

In  Table  4.2,  we  show  another  example  analysis  of  a  more  complex  task  set  that 
synchronizes  using  an  ICS  only.  We  observe  that  this  task  set  is  another  example 
that  cannot  be  guaranteed  to  be  feasible  if  PCP  is  used,  but  is  guaranteed  to  be 
feasible  when  an  ICS  is  used. 


Table  4.2.  Example  task  description  2  for  ICS 


task 

Ti 

Q 

Zi 

bt,s 

A 

r, 

Tl 

20  ms 

2.5  ms 

X 

1  ms 

5.5  ms 

2.5  ms 

T2 

20 

2.5 

Y 

1 

5.5 

5.0 

73 

30 

5 

X 

1 

15 

11 

7"4 

40 

4 

Y 

1 

25 

16 

T5 

50 

4 

X,Y 

1,1 

30 

29 

84 


To  incorporate  a  mixed  system  that  uses  both  the  ICS  and  the  PCP  techniques,  we 
need  to  modify  equation  4.4  to  account  for  blocking  and  the  restricted  interruption. 
In  particular,  Tj  can  cause  re-executions  only  if  it  does  not  block  before  executing 
critical  section  z. 

We  define 

bp(z;j,i)  =   maxj<jt<,-  b(z,  k)  j  <  I(z),  3k  j  <  k  <  i  and  z  €  Zk 
-  0  Otherwise 

bptot(j,  i)  =   max2€Z  b(z;  j,  i) 

Since  all  tasks  that  access  z  numbered  larger  than  I(s)  use  the  Priority  Ceiling 
Protocol,  we  must  calculate  the  worst-case  execution  times  of  critical  sections  that 
can  be  locked,  BP(z).  In  particular,  we  must  account  for  the  possible  re-executions 
of  z. 


BP(z)  =   m*x{\rjlTi]b{z,j)\z  €  Z{  n  Zj,  i  <  I(z)  <  j} 
3i,j  st.  z  €  Zi  n  Zj,  i<  I(z)  <  j 

m&x{b(z,j)\z  €  Zj}  l(z)  =  0 

=  0  Otherwise 


B(i)  =  m&x{BP(z)\z  can  block  i} 

Then,  the  response  time  of  task  TJ  is  the  solution  of 

ri  =  Ci  +  Ei«  \rifTi]  (Cj  +  bPtot(j,  j))  +  B(i) 


(4.5) 
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In  Table  4.3,  we  show  an  example  analysis  of  a  task  set.  The  column  labeled 
r,  (ICS)  shows  the  worst-case  response  times  when  interruptible  critical  sections  are 
used  for  synchronization,  and  the  colums  labeled  r,  (ICS  +  PCP)  shows  the  worst 
case  response  time  when  interruptible  locks  are  used.  For  the  interruptible  locks, 
we  assume  that  tasks  t\  and  r2  never  block,  but  tasks  T3  through  t$  acquire  a  lock 
using  the  Priority  Ceiling  Protocol.  We  observe  that  the  task  set  cannot  meet  its 
deadlines  if  only  the  PCP  is  used.  Furthermore,  task  t$  cannot  be  guaranteed  to  meet 
its  deadline  when  only  ICS  is  used.  However,  a  combination  of  interruptible  locks 
and  the  priority  ceiling  protocol  lets  all  tasks  meet  their  deadlines.  We  observe  that 
interruptible  locks  penalize  intermediate-priority  tasks,  due  to  the  possible  blocking 
waits,  B(i).  However,  the  response  time  of  high  priority  tasks  is  reduced  because  of 
the  reduced  rate  of  critical  section  interruptions. 


Table  4.3.  Example  task  description  3  for  ICS 


task 

Ti 

Ci 

Zi 

Di 

r,  (ICS) 

(ICS  +  PCP) 

Tl 

25  ms 

3  ms 

X 

1  ms 

6.5  ms 

3  ms 

3  ms 

T2 

25 

3 

Y 

1 

6.5 

6 

6 

T3 

30 

3 

X 

1 

15 

10 

12 

T4 

30 

3 

Y 

1 

20 

14 

16 

r~> 

30 

3 

X 

1 

30 

18 

19 

r& 

30 

3 

Y 

1 

30 

22 

22 

t7 

100 

3 

X 

1 

80 

49 

45 

Ts 

100 

3 

Y 

1 

80 

86 

48 

4.9  Conclusion 

We  have  presented  methods  for  implementing  interruptible  critical  sections  (ICS), 
and  using  them  with  interruptible  locks.  Interruptible  critical  sections  use  optimistic 
concurrency  control  instead  of  pessimistic  concurrency  control.  If  a  process  that  is 
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executing  an  ICS  is  interrupted  and  a  conflicting  operation  commits,  the  conflicted 
process  restarts  its  execution  from  the  beginning  of  the  critical  section.  In  a  real-time 
system,  interruptible  critical  sections  prevent  priority  inversion.  In  addition,  the  ICS 
mechanism  is  independent  of  the  scheduling  algorithm.  We  show  how  several  recent 
ideas  in  non-blocking  and  uniprocessor  synchronization  can  be  synthesized  to  provide 
low-overhead  interruptible  critical  sections. 

We  show  how  an  ICS  can  be  implemented  in  practice,  and  discuss  our  ICS  im- 
plementation in  the  pSOS+  operating  system.  We  find  that  the  use  of  a  prioritized 
semaphore  can  lead  to  unpredictable  executions  times  of  high  priority  tasks,  while 
the  use  of  an  ICS  allows  the  high  priority  task  to  always  complete  quickly. 

The  use  of  interruptible  critical  sections  only  can  cause  too  many  critical  section 
re-executions,  making  low-priority  tasks  unschedulable.  Interruptible  critical  sections 
can  be  used  with  locks  to  create  interruptible  locks.  We  show  that  when  an  inter- 
ruptible lock  is  used,  a  low-priority  task  never  blocks  a  high  priority  task,  and  the  low 
priority  tasks  experience  only  a  small  degradation  in  execution  time.  Interruptible 
locks  are  appropriate  when  very  time  sensitive  tasks  must  communicate  with  lower 
priority  tasks  through  shared-memory  data  structures. 

We  present  an  analysis  of  a  hard  real-time  periodic  task  set  that  synchronizes 
using  interruptible  critical  sections.  We  show  that  if  the  highest  priority  tasks  have 
very  tight  deadlines,  then  interruptible  critical  sections  can  improve  the  schedulability 
of  the  task  set.  Interruptible  locks  can  be  used  in  conjunction  with  the  priority  ceiling 
protocol.  We  show  that  using  interruptible  locks  with  PCP  can  improve  schedulability 
over  using  ICS  or  PCP  alone. 
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Figure  4.3.  Response  time  distribution  of  the  non-prioritized  semaphore 
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Figure  4.4.  Response  time  distribution  of  the  prioritized  semaphore 
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Interruptable  Critical  Sections 
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Figure  4.5.  Response  time  distribution  of  the  Interruptible  Critical  Section 
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Figure  4.6.  Response  time  distribution  of  the  Interruptible  Lock 
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Figure  4.7.  Response  time  distribution  of  the  Interruptible  Critical  Section  for  high 
lock  utilization 
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Figure  4.8.  Response  time  distribution  of  the  Interruptible  Lock  for  high  lock  uti- 
lization 


CHAPTER  5 

EXTENDING  INTERRUPTIBLE  CRITICAL  SECTIONS  TO  MULTIPROCESSORS 

5.1  Introduction 

Mutual  exclusion  is  a  significant  problem  for  sharing  a  resource  in  a  real-time 
shared-memory  multiprocessor  system.  Most  lock  based  synchronization  mechanisms 
can  block  a  high  priority  task  that  requires  the  lock  when  a  low  priority  task  is  holding 
the  lock,  causing  priority  inversion.  In  this  chapter,  we  present  algorithms  that 
extend  Interruptible  Critical  Sections  to  shared-memory  multiprocessor  environment 
(ICSM). 

Takada  and  Sakamura  [91]  proposed  algorithms  that  extend  queuing  spin-locks  to 
be  preempted  for  servicing  interrupts.  They  address  the  conflicting  issue  of  servicing 
a  pending  interrupt  while  holding  a  lock.  Rajkumar  et  al.  extended  the  Priority 
Ceiling  Protocol  that  limits  the  priority  inversion  in  a  uniprocessor  to  multiprocessors 
[79].  Shu  et  al.  [84]  proposed  an  Abort  Ceiling  Protocol,  an  extension  to  the  Priority 
Ceiling  Protocol.  In  this  algorithm,  an  abort  ceiling  priority  is  associated  with  a  task. 
The  abort  ceiling  comes  into  effect  when  the  task  is  executing.  Another  task  may 
abort  the  currently  running  task  and  run  immediately  if  its  priority  is  higher  than 
the  current  abort  ceiling.  The  protocol  relies  on  the  Interruptible  Critical  Sections 
to  restart  the  critical  section  of  the  aborted  task.  Also,  the  protocol  assumes  static 
priorities.  The  Ceiling  Abort  Protocol  [92]  proposed  by  Takada  and  Sakamur  is  a 
similar  extension  to  the  Priority  Ceiling  Protocol.  This  protocol  assigns  an  abort 
ceiling  priority  to  the  critical  section  instead.  Also,  the  critical  section  is  divided  into 
abortable  and  non-abortable  segments. 

90 


91 


McCann  et  al.  [65]  conclude  that  preempting  processors  in  a  coordinated  way 
is  critical  to  response  times  while  using  critical  sections.  Anderson  et  al.  [3]  show 
that  a  naive  implementation  of  spin-locks  can  not  only  delay  the  processor  waiting 
for  a  lock,  but  other  processors  doing  work.  They  suggest  an  Ethernet-style  backoff 
scheme  or  a  queue-based  algorithm  for  reducing  the  cost  of  spin- waiting.  Anderson 
et  al.  [5]  argue  that  the  operating  system  should  recognize  that  a  preempted  thread 
is  executing  in  a  critical  section,  and  execute  the  preempted  thread  until  the  thread 
exits  the  critical  section.  Alemany  and  Felton  [2]  consider  implementation  issues  of 
non-blocking  concurrent  objects  on  shared-memory  multiprocessors.  They  show  how 
the  resources  wasted  by  the  non-blocking  operations  that  fail  and  the  cost  of  data 
copying  required  by  a  non-blocking  implementation  can  be  reduced  by  relying  on  the 
operating  system  support. 

Interruptible  Critical  Sections  is  an  optimistic  synchronization  mechanism  on 
uniprocessors  with  applicability  to  embedded  and  real-time  systems.  Techniques 
using  interruptible  critical  sections  improve  the  schedulability  of  task  sets  that  have 
high  priority  tasks  with  tight  deadlines.  In  this  chapter,  we  extend  interruptible 
critical  sections  to  shared-memory  multiprocessors  under  progressively  complex  en- 
vironments. Although  the  methodology  for  writing  interruptible  critical  sections 
remain  the  same  as  discussed  in  Chapter  4,  the  algorithms  differ  in  the  techniques 
used  in  detecting  the  critical  section  access  conflicts  for  a  multiprocessor.  We  discuss 
the  algorithms  as  implemented  on  a  multiprocessor  system  running  the  pSOS+  real- 
time operating  system.  We  study  the  various  parameters  under  which  interruptible 
critical  sections  outperform  other  lock-based  algorithms  using  our  implementation, 
simulation,  and  analytical  modeling.  We  also  present  a  formal  proof  for  one  of  the 
algorithms  showing  that  it  is  indeed  correct. 
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In  this  chapter,  we  extend  interruptible  critical  sections  to  multiprocessor  systems. 
The  main  issue  here  is  how  to  detect  the  conflict  among  tasks  spread  across  multiple 
processors  in  using  the  shared  critical  section.  In  the  uniprocessor  case,  we  were 
able  to  detect  the  conflict  during  the  context  switch,  but  this  is  no  longer  possible 
in  a  multiprocessor.  To  resolve  this  issue,  we  let  the  critical  section  be  protected 
by  a  global  lock  mechanism.  Depending  on  how  we  handle  the  lock,  we  present  the 
following  algorithms. 

•  ICSM  with  Lock  Release  (ICSM-R) 

In  a  multiprogrammed  multiprocessor,  the  task  owning  the  lock  may  experience 
a  context-switch.  In  such  an  event,  tasks  on  other  processors  waiting  for  the 
lock  will  be  blocked  for  a  significant  amount  of  time  much  larger  than  the 
critical  section  execution  time.  This  bottleneck  will  be  more  pronounced  as  the 
number  of  processors  and  the  degree  of  multiprogramming  are  increased. 

During  a  context  switch,  the  lock  is  released  from  the  old  task  in  the  ICSM-R  al- 
gorithm. When  the  task  is  re-scheduled  on  the  processor,  the  lock  is  re-acquired 
and  the  test  for  conflict  is  performed  and  the  critical  section  is  restarted,  if  nec- 
essary. This  solution  avoids  the  problem  of  detecting  a  conflict. 

•  ICSM  with  Task  Kill  (ICSM-K) 

A  common  scenario  in  a  multiprocessor  real-time  system  is  a  high  priority 
Server  task  servicing  the  requests  of  several  low  priority  Client  tasks. 

In  this  model,  we  assume  that  there  is  only  one  high  priority  task  among  a 
set  of  tasks  using  the  critical  section.  The  low  priority  tasks  synchronize  the 
usage  of  the  critical  section  with  a  semaphore.  The  high  priority  task  sets  a 
kill  flag  before  it  uses  the  critical  section.  If  there  is  already  another  task  that 
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is  executing  in  the  critical  section,  the  set  state  of  the  kill  flag  is  detected  and 
the  critical  section  is  re-executed. 

This  solution  detects  a  conflict  by  using  shared-memory  and  the  atomic  Com- 
pare&Swap  instruction. 

•  ICSM  with  Priority  Queue  (ICSM-P) 

Here,  we  consider  a  set  of  tasks  with  varying  priorities  that  execute  on  a  mul- 
tiprocessor. Instead  of  the  spin-lock,  we  protect  the  critical  section  using  a 
priority  queue  lock,  using  the  mechanism  of  the  PR- Lock  (Chapter  3). 

First,  a  task  enqueues  in  the  priority  queue  as  in  the  PR-Lock.  When  a  higher 
priority  task  finds  that  the  critical  section  is  busy  with  a  lower  priority  task,  it 
interrupts  the  lower  priority  task  at  the  head  of  the  queue  and  uses  the  critical 
section.  The  lower  priority  task  detects  the  failure  during  the  release  of  the 
lock,  and  retries  the  lock  acquisition. 

In  the  following  sections,  we  will  discuss  the  implementation,  and  the  performance 
of  each  of  these  algorithms. 

5.2    ICSM  with  Lock  Release  (ICSM-R) 

In  a  multiprogrammed  multiprocessor  system,  existing  lock  based  mechanisms 
hold  the  lock  for  a  task  even  during  a  context-switch.  The  duration  of  the  context- 
switch  depends  on  the  time  quantum  and  the  degree  of  multiprogramming.  This 
duration  can  be  high,  resulting  in  a  wasted  lock  utilization  and  more  importantly,  an 
increase  in  the  average  response  time  for  tasks  using  the  lock. 

In  this  section,  we  study  the  possibility  of  releasing  the  lock  held  by  a  task  during  a 
context-switch  by  proposing  the  ICSM-R  algorithm.  Once  the  lock  is  released  during 
a  context-switch,  it  is  reacquired  when  the  task  executes  in  the  next  quantum.  This 
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results  in  low  lock  utilization,  but  there  is  a  penalty  of  restarting  the  critical  section 
if  there  is  a  previous  commit  in  the  critical  section  by  some  other  task.  The  critical 
section  response  time  will  be  low  if  the  number  of  restarts  of  the  critical  section  are 
small.  We  study  the  parameters  and  conditions  under  which  the  ICSM-R  algorithm 
performs  better  than  an  algorithm  that  does  not  release  the  lock  during  a  context- 
switch. 

We  implemented  ICSM-R  with  spin-locks  using  the  Test&Set  instruction.  Two 
data  structures  are  needed  to  implement  ICSM-R:  one  for  the  critical  section,  and 
one  for  each  task  that  uses  the  critical  section. 

The  global  lock  structure  consists  of  a  critical  section  identifier,  a  counter  that 
tracks  the  number  of  times  the  critical  section  has  been  executed,  and  the  critical 
section  bounds.  It  also  contains  the  address  of  the  global  spin-lock  variable. 

struct  ICS_Lstruct  { 

int      id;  /*  ID  of  this  critical  section  */ 

int     cs_count;  /*  Global  Execution  Count  */ 

char    *cstop;     /*  CS  Top  Address  including  the  spin-lock  */ 

char    *cslow;     /*  CS  Low  Address  excluding  the  spin-lock  */ 

char    *cshigh;   /*  CS  High  Address  */ 

int     gcount ;     /*  Shared  Counter  */ 

int     *glock;     /*  Spin-Lock  Address  */ 

} 

The  structure  local  to  a  task  consists  of  the  copy  of  the  ICS  execution  count, 
a  count  of  the  number  of  times  the  critical  section  is  retried  on  any  invocation  (for 
statistics),  a  pointer  to  the  ICS-Lstruct  and  a  flag  to  indicate  that  the  task  is  entering 
the  critical  section. 


struct  ICS_Tstruct  { 

int  process_count ;         /*  Local  Execution  Count  */ 

struct      ICS-Lstruct  *ilp;     /*  Interruptible  Lock  Record  Pointer  */ 
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int 
int 


i count ; 
flag; 


/*  Interrupt  Count  of  a  task  */ 

/*  Flag  =  ID  =>  In  CS;  =0  =>  Not  */ 


} 


The  ICS  implementation  code  consists  of  two  parts:  The  ICSM-Rctxsw  routine 
which  provides  the  ICS  Lock  mechanism  and  the  ICSM-Rclient  task  that  uses  the 
ICS  mechanism. 

5.2.1    ICSM-Rctxsw  Routine 

The  ICSM-Rctxsw  routine  is  shown  in  Figure  5.1.  The  ICSM-Rctxsw  routine 
is  integrated  with  the  pSOS+  kernel  as  a  user  written  routine  that  is  called  during 
a  context-switch.  The  call  occurs  at  the  point  where  the  context  of  the  switched- 
out  task  has  been  completely  saved,  and  before  the  context  of  the  switched-in  task  is 
loaded.  pSOS+  provides  the  addresses  of  the  Task  Control  Blocks  (TCBs)  of  both  the 
switched-in  task  and  the  switched-out  task  in  machine  registers.  The  TCB  contains 
all  the  context  of  a  task,  including  the  Program  Counter  (PC).  ICSM-Rctxsw  can 
reset  the  PC  in  the  TCB  of  a  switched-in  task,  if  required.  pSOS+  provides  a  set 
of  eight  software-defined  user  registers  that  a  task  can  access  in  the  TCB.  The  user 
register  0,  U_REG0,  is  used  to  contain  the  address  of  the  ICS-Tstruct  of  a  task  using 


ICSM-Rctxsw  first  checks  if  the  program  counter  (PC)  of  the  old  task  about  to 
be  switched  out  is  within  the  critical  section  region,  and  if  so,  it  releases  the  spin-lock 
by  setting  the  lock  variable  to  zero.  Next,  ICSM-Rctxsw  routine  checks  if  the  new 
task  about  to  be  switched  in  is  within  the  critical  section  region.  If  so,  it  attempts  to 
reacquire  the  spin-lock  without  spinning.  If  successful,  ICSM-Rctxsw  checks  if  there 
was  a  conflicting  operation  in  the  interim.  If  so,  it  sets  the  task's  PC  to  reexecute 
the  critical  section.  If  there  is  no  conflict,  the  task  is  allowed  to  continue  where  it  left 


ICS. 
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ICSM-RctxswO 
{ 

struct  tcb  *in_tcb,  *out_tcb; 
struct  ICS_Tstruct  *tlp; 


load  out_tcb  from  the  machine  register; 

tip  =  Get  U_REG0  from  out.tcb; 

if (tip  !=  NULL  kk  tlp->flag  ==  LOCKID) 

{ 

if (tcb->currpc  >=  tlp->ilp->cslow  kk 
tcb->currpc  <  tlp->ilp->cshigh) 


*(tlp->ilp->glock)  =  0; 


} 

load  in_tcb  from  the  machine  register; 

tip  -  Get  UJtEGO  from  in.tcb; 

if (tip  !=  NULL  kk  tlp->flag  ==  LOCKID) 

{ 

if (tcb->currpc  >=  tlp->ilp->cslow  &^ 
tcb->currpc  <  tlp->ilp->cshigh) 


{ 


status  =  Test&Set (tlp->ilp->glock) ; 
if (status  ==  SUCCESS) 

{ 

if (tlp->process_count  !=  tlp->ilp->cs_count) 
in_tcb->currpc  =  tlp->ilp->cslow; 

} 

else 
{ 

in_tcb->currpc  =  tlp->ilp->cstop; 

} 


Figure  5.1.  The  context  switch  routine  for  ICSM-R 
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when  it  was  switched  out.  If  the  attempt  to  re-acquire  the  lock  by  the  ICSM-Rctxsw 
routine  is  unsuccessful,  the  task  is  made  to  restart  from  the  point  of  acquiring  the 
lock  for  the  critical  section. 

5.2.2  ICSM-Rclient  Tasks 

We  choose  to  implement  a  shared  global  counter.  On  each  processor,  we  run 
four  tasks.  Among  these  four  tasks,  only  one  task  is  the  ICSM-Rclient  task  that 
increments  the  shared  counter  and  the  rest  are  dummy  tasks.  All  the  four  tasks  are 
started  under  the  control  of  a  low  priority  parent  task.  Message  Queues  of  pSOS+ 
are  used  for  communication  between  the  tasks  and  the  parent.  The  algorithm  for  an 
ICSM-Rclient  task  is  presented  in  Figure  5.2. 

5.2.3  ICSM-R  Performance  Analysis 

First,  we  analyzed  the  performance  of  ICSM-R  algorithm  as  implemented,  by 
conducting  a  set  of  experiments.  We  compared  the  performance  of  ICSM-R  with  a 
spin-lock  algorithm  that  does  not  release  the  lock  during  a  context-switch  (LOCK- 
NR).  As  we  are  limited  by  the  number  of  processors  available  for  the  implementation, 
and  to  better  understand  the  implications  of  various  parameters,  we  constructed 
an  analytical  model.  We  validated  the  analytical  model  by  using  discrete  event 
simulation.  We  present  the  results  of  the  experiments  and  the  analysis  in  the  following 
sub-sections. 

Experimental  Performance  Results 

We  analyzed  the  performance  of  ICSM-R  algorithm  by  conducting  an  experiment. 
The  goal  of  the  experiment  is  to  increment  a  global  counter  till  a  target  value  is 
reached.  A  spin-lock  critical  section  is  used  to  protect  the  global  counter  from  multiple 


Parent ( ) 
{ 

Setup  the  global  ICS-Lstruct  ilr; 

Startup  all  the  tasks; 

while(all  the  tasks  are  not  done) 

{ 

collect  the  timing  and  retries  from  each  run  of  a  task; 

} 

Report  the  parameters  and  statistics; 

} 

ICSM-RclientQ 

{ 

struct  ICS.Tstruct  tlr; 


Setup  the  local  ICS.Tstruct  structure  tlr; 
Setup  the  U_REG0  to  point  to  local  ICS.Tstruct; 
while( !Done) 

{ 

Work  for  Tw  cycles; 
tlr.  flag  =  LOCKJD; 
ics_top: 
do 

{ 

status  =  Test&Set(tlr . ilp->glock) ; 

}while(status  ==  FAILURE); 
ics_start : 

tlr  .process  jcount  =  tlr.ilp->csjcount; 

tlr. ilp->gcounter++; 

if (tlr.ilp->gcounter  ==  TARGET-COUNT) 
done  =  TRUE; 

Idle  for  CS-IDLE  cycles  of  time; 

*(tlr.ilp->glock)  =  0; 

tlr . ilp->cs jcount++ ; 
ics_end: 

tlr. flag  =  0; 

Report  the  time  taken  and  retry  count  to  parent; 

} 

Clear  UJtEGO; 

Report  done  to  parent; 

} 


Figure  5.2.  The  algorithm  for  a  task  using  ICSM-R 
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updates.  We  used  the  Motorola  68030  multiprocessor  system  with  shared-memory 
running  the  pSOS+  kernel. 

We  compared  the  performance  of  ICSM-R  with  a  spin-lock  algorithm  that  does 
not  release  the  lock  during  a  context-switch  (LOCK-NR).  Each  processor  executes 
four  tasks,  where  each  task  works  for  Tw  units  of  time.  One  task  among  the  four 
tasks  enters  the  critical  section.  It  stays  in  the  critical  section  for  Tc  time  units.  We 
varied  the  number  of  processors  from  one  to  four.  In  our  experiment,  Tw  and  Tc  are 
random  variables  uniformly  distributed  by  20%  about  a  selected  mean.  We  set  Tw 
to  20  milliseconds  (ms).  The  time  quantum  for  a  task  is  20  ms  for  processor  sharing 
among  the  four  tasks.  We  varied  the  critical  section  execution  time  Tc  to  reflect 
various  load  conditions  with  lock  utilization  of  1.25%,  6%  and  10%  per  processor. 

The  performance  results  are  shown  in  Figure  5.3.  Experimental  results  confirm 
that  as  the  number  of  processors  increase,  ICSM-R  performed  better  than  LOCK-NR. 
When  using  a  small  critical  section,  the  performance  improvement  is  little  as  there 
is  only  a  small  probability  that  a  context-switch  occurs  during  a  critical  section.  As 
the  critical  section  size  increases,  this  probability  is  higher,  and  the  task  releasing  the 
lock  during  a  context-switch  helps  tasks  running  on  other  processors  to  acquire  the 
lock  faster,  thereby  avoiding  wasted  cycles  of  spinning.  Note  that  the  elapsed  time 
can  increase  as  the  number  of  processors  are  increased  beyond  a  certain  limit.  This 
can  be  attributed  to  the  fact  that  the  cost  of  synchronization  can  exceed  the  speedup 
that  can  be  achieved  by  an  increase  in  the  number  of  processors.  The  elapsed  times 
for  the  LOCK-NR  algorithm  increases  as  the  number  of  processors  are  increased  from 
3  to  4,  as  shown  in  Figure  5.3. 
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Figure  5.3.  ICSM-R  Performance  Results  (Tw  =  20  milli  seconds) 
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Analytical  Performance  of  ICSM-R 

In  this  section,  we  derive  the  performance  of  the  ICSM-R  algorithm  and  the 
LOCK-NR  algorithm  using  an  analytical  model. 

For  the  analytical  model,  we  consider  N  number  of  processors  running  M  tasks 
each.  Each  task  on  a  processor  works  for  Tw  units  of  time.  In  addition,  one  out  of 
M  tasks  on  a  processor  requests  the  service  of  a  critical  section  that  takes  Tc  time 
units  to  execute.  Each  processor  is  shared  among  M  tasks  using  a  time  quantum 
of  Tq  time  units  in  a  round-robin  fashion.  The  cycle  time  of  a  task  is  composed  of 
the  work  time  and  the  critical  section  time,  if  any,  of  the  task.  We  are  interested  in 
estimating  the  cycle  time  of  the  task  using  the  critical  section. 

We  use  the  following  notation  for  the  analysis: 

N  :  Number  of  processors 

M  :  Number  of  tasks  per  processor 

Tw  :  Work  time  for  each  task 

Tq  :  Time  quantum  for  processor  sharing 

Tc  :  Critical  Section  execution  time 

Rp  :  Critical  section  utilization  per  processor 

R  :  Total  critical  section  utilization 

X  :  The  total  work  time  in  a  cycle 

Z  :  The  total  time  spent  in  waiting  for  the  critical  section  in  a  cycle 
B  :  The  CPU  time  spent  holding  the  critical  section. 
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Figure  5.4.  Model  of  a  cycle  for  a  task  using  LOCK-NR 
E[CxinCs]  :  Expected  number  of  context-switches  while  in  critical  section 


Cn  '•  Cycle  time  for  a  task  using  NICSM-R 


Cj  :  Cycle  time  for  a  task  using  ICSM-R 


Analysis  of  algorithm  LOCK-NR 

In  this  model,  the  critical  section  is  not  released  during  a  context-switch  while 
the  critical  section  is  being  executed  by  a  task. 

A  cycle  of  the  task  using  the  critical  section  on  a  processor  is  shown  in  Figure  5.4. 
A  cycle  consists  of  the  work  time  X,  Z  units  of  time  waiting  for  the  critical  section 
and  B  units  of  time  in  the  critical  section. 

We  have, 


CN   =   X  +  Z  +  B  (5.1) 
X   =   M*TW  (5.2) 

B  depends  on  Tc,  Tq  and  the  number  of  context  switches  that  are  possible  while 
holding  the  critical  section. 


E[CxinCs]   =  ^ 

-*9 


B   =   Tc  + E[CxinCs]*{M -l)*Tq 
=    Tc  +  ^*(M-l)*Tq 

=   M*TC  (5.3) 
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Assuming  that  a  request  for  critical  section  is  uniformly  distributed  in  time,  the 
probability  of  a  conflict  in  using  the  critical  is  just  the  percentage  of  time  the  critical 
section  is  being  used.  Considering  that  the  critical  section  can  be  modeled  as  a 
M/M/l  queue,  the  expected  blocking  time  Z  is  given  by 


z  -  (T^T*  (5-4) 

where  Rr  is  the  utilization  of  the  rest  of  the  N  -  1  tasks  that  use  the  critical 
section,  given  by 


*  -  iJ^*« 

and  Br  is  the  residual  life  [51]  of  the  lock  holding  time  B  for  which  the  task 
under  consideration  is  blocked.  Assuming  that  the  lock  holding  time  is  uniformly 
distributed  with  a  mean  B  and  variance  as  —  ptc,  Br  is  given  by 


B  a 


Br   =    -  + 


2  2*B 

As  only  one  task  uses  the  critical  section  on  a  processor,  the  utilization  per  pro- 
cessor is  given  by 


Kp  ~    (X  +  Z  +  B) 
Then,  the  critical  section  utilization  is 


R   =  N*RP 
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Figure  5.5.  Model  of  a  cycle  for 

a  task 

using 

ICSM-R 

=  N* 
=  N* 


B 


(X  +  Z  +  B) 
B 


(5.5) 


(*  +  (iiflfcy  + 

X  and  B  can  be  computed  using  equations  5.2  and  5.3,  respectively.  We  can 
compute  R  using  equation  5.5  with  iteration,  setting  the  initial  value  of  R  to  be  zero. 
Knowing  R  and  B,  Z  can  be  computed,  and  hence  the  cycle  time  Cat. 

Analysis  of  algorithm  ICSM-R 

In  this  model,  the  critical  section  is  released  during  a  context-switch  while  the 
critical  section  is  being  executed  by  a  task.  The  critical  section  is  re-acquired  and  con- 
tinued by  the  task  during  the  next  quantum.  This  acquire/release  cycle  is  continued 
till  the  critical  section  is  completed. 

A  cycle  of  the  task  using  the  critical  section  on  a  processor  is  shown  in  Figure 
5.5.  In  this  case,  after  the  work  period  X,  there  is  a  possible  blocking  time  Zl  spent 
waiting  for  the  critical  section  to  be  free.  Then  there  is  the  partial  critical  section 
holding  time  Bl.  At  this  time  the  task  may  experience  a  context  switch  and  another 
blocking  time  represented  by  Z2.  This  acquire  /  release  critical  section  is  continued 
till  the  critical  section  is  completed.  There  is  a  possibility  of  restarting  the  task  at 
the  beginning  of  the  critical  section  whenever  the  critical  section  is  re-acquired,  if 
there  is  a  previous  commit  in  the  critical  section  during  the  time  period  when  the 
critical  section  is  last  released  and  re-acquired. 

We  have, 


105 


Ci  =  X  +  Z  +  B  (5.6) 
X   =   M*TW  (5.7) 

where  Z  =  Zl  +  Z2  +  ...  +  ZF  and  B  =  Bl  +  B2  +  ..  +  BF.  B  depends  on  Tc 
and  the  number  of  times  a  task  is  restarted  from  the  beginning  of  the  critical  section 
because  of  a  commit  by  another  task. 


B   =   Tc  +  Tr*NT 

where  Tr  is  the  partial  execution  time  of  the  critical  section  before  a  task  is 
restarted  because  of  a  conflict,  and  Nr  is  the  number  of  times  the  task  is  restarted 
because  of  a  conflict  before  it  commits.  The  expected  value  of  Tr  =  Tc  /  2.  Given 
the  probability  of  restart  of  a  task  within  the  critical  section  as  Pr[CSRestart], 


NT   =   ^i*Pr[CS  Restart}' 

i>0 

Pr[CSRestart] 
(1  -  Pr[CS Restart})2 

Then, 


TTC  PrjCSRestart} 

Jc+  2     (I -Pr[CS Restart})*  [™> 


where 


Pr[CSRestart]   =  Pr[CxinCs\* 

Pr[A  Commit  in  the  previous  (M  —  1)  *  Tq  interval] 
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We  can  estimate  that  some  other  task  commits  in  the  previous  interval  of  (M  - 
1)  *Tqby  modeling  the  commits  as  a  Poisson  process  with  an  arrival  rate 

=  (N-l) 
(X  +  B  +  Z) 

Then, 

Pr[A  commit  in  the  previous   (M  —  1)   *  Tq  interval] 

=        1  —  Pr[No  commit  in  the  previous 

(M  —  1)  *  Tq  interval] 
I  _  e(-A*(M-i).r,) 

Pr[CxinCs]        =        ^  for  M  >  1 
=        0  Otherwise 

The  blocking  time  Z  is  due  to  the  critical  section  being  busy,  and  a  context  switch 
while  in  critical  section. 

D 

Z   =   Pr[CxinCs]*(M-l)*Tq+(l  +  Pr[CxinCs])*- — r—  *  Br  (5.9) 

(1  —  Rr) 

where  Rr  is  the  utilization  of  the  rest  of  the  N-l  tasks  that  use  the  critical 
section,  given  by 

and  BT  is  the  residual  life  of  the  lock  holding  time  B  for  which  the  task  under 
consideration  is  blocked.  Assuming  that  the  lock  holding  time  is  uniformly  distributed 
with  a  mean  B  and  variance  o~b  =  <Ttc,  Br  is  given  by 


107 


B  a 


Br   =   -  + 


2  2*5 

As  only  one  process  uses  the  critical  section  on  a  processor,  the  utilization  per 
processor  is  given  by 


p        (X  +  Z  +  B) 
Then,  the  critical  section  utilization  is 


R   =  N*RP 


=  Nt(X  +  Z  +  B)  <M°> 
X  can  be  computed  using  equation  5.7.  We  can  compute  B  and  R  with  equations 
5.8,  5.9  and  5.10  using  iteration,  setting  the  initial  values  of  B  to  be  Tc  and  of  R  to 
be  zero.  Knowing  R  and  B,  Z  can  be  computed,  and  hence  the  cycle  time  Cj. 

Validation  of  Analysis 

We  validated  the  analysis  by  simulation  using  SIMPACK  [25],  a  discrete  event 
simulation  package.  We  set  the  values  of  M  —  4,  Tw  —  1000,  and  Tq  —  100.  The 
work  time  Tw  is  a  random  variable  uniformly  distributed  between  800  and  1200  with 
a  mean  of  1000.  The  critical  section  time  is  also  an  uniformly  distributed  random 
variable  with  a  range  of  20%  either  way  about  the  mean.  In  each  experiment,  we 
selected  a  different  Tc  from  10  to  90  to  represent  a  wide  range  of  critical  section 
utilizations.  The  results  comparing  the  cycle  times  obtained  by  simulation  and  the 
cycle  times  computed  by  analysis  are  given  in  Table  5.1.  As  can  be  seen,  except 
for  a  high  Tc/Tg  ratio,  the  results  are  reasonably  accurate  for  the  analysis  to  be 
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meaningful.  The  inaccuracy  in  the  analytical  model  is  due  to  the  1/(1-/2,)  factor 
that  is  present  for  the  computation  of  Z.  This  factor  goes  to  infinity  as  R  becomes 
100%.  A  similar  conclusion  can  be  drawn  for  the  lock  utilization  obtained  from 
simulation  and  analysis,  as  presented  in  Table  5.2. 

Performance  comparison  using  analysis 

We  analyzed  the  performance  of  ICSM-R  and  LOCK-NR  algorithms  using  the 
model  developed  in  the  previous  sub-sections. 

The  cycle  times  of  the  tasks  using  the  critical  section  under  various  critical  section 
execution  times  are  shown  in  Figures  5.6  through  Figure  5.10.  The  work  time  Tw 
is  1000  units,  and  M  is  4.  For  a  low  critical  section  to  quantum  time  ratio  of  up 
to  50%,  ICSM-R  algorithm  performs  better  than  LOCK-NR  algorithm.  Even  for  a 
higher  ratio,  ICSM-R  algorithm  performs  better  for  small  number  of  processors.  We 
observe  that  there  is  a  steep  transition  in  the  cycle  times  when  the  lock  utilization 
reaches  100%  for  both  the  algorithms.  As  expected,  the  critical  section  utilization  is 
less  for  ICSM-R  as  shown  in  Figures  5.11  through  5.15. 

We  analyzed  the  effect  of  multiprogramming  by  varying  M  from  1  to  16,  with  Tc 
=  Tq  I  4  and  Tc  —  Tq  /  2.  Except  for  the  case  of  no  multiprogramming,  ICSM-R 
always  performs  better  as  shown  in  figures  5.16  through  5.25. 

We  concluded  that  for  a  low  to  moderate  critical  section  execution  time  to  quan- 
tum time  ratio,  it  is  advantageous  to  use  the  ICSM-R  algorithm.  In  general,  critical 
section  execution  times  are  very  short  and  only  a  fraction  of  the  quantum  times. 
Thus,  the  ICSM-R  algorithm  that  releases  the  lock  during  a  context-switch  is  an 
attractive  alternative  to  other  lock-based  algorithms. 
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Table  5.1.  Validating  cycle  time  analysis  using  simulation  for  ICSM-R 


Cycle 

Time 

ICSM-R 

LOCK-NR 

Tc 

Processors 

Simulation 

Analysis 

Abs.% 
Diff 

Simulation 

Analysis 

Abs.% 
Diff 

10 

4 

4044.30 

4044.31 

0.00 

4044.50 

4044.39 

0.00 

8 

4040.04 

4040.01 

0.00 

4040.64 

4040.48 

0.00 

12 

4049.11 

4049.08 

0.00 

4050.32 

4050.12 

0.00 

16 

4055.16 

4054.95 

0.00 

4056.57 

4056.44 

0.00 

20 

4038.95 

4038.11 

0.01 

4042.03 

4041.72 

0.02 

24 

4039.85 

4038.93 

0.02 

4042.91 

4041.62 

0.03 

28 

4049.14 

4048.36 

0.02 

4053.54 

4053.23 

0.01 

32 

4052.63 

4051.64 

0.02 

4057.69 

4057.75 

0.00 

36 

4055.59 

4054.40 

0.03 

4062.63 

4062.16 

0.01 

40 

4046.91 

4045.40 

0.04 

4055.67 

4055.13 

0.01 

44 

4039.62 

4037.89 

0.04 

4050.62 

4049.92 

0.02 

48 

4049.14 

4046.98 

0.05 

4062.36 

4061.72 

0.02 

52 

4048.51 

4046.26 

0.06 

4064.73 

4063.90 

0.02 

56 

4044.58 

4041.87 

0.07 

4064.06 

4062.51 

0.04 

60 

4046.44 

4043.55 

0.07 

4068.55 

4068.77 

0.00 

64 

4049.33 

4046.37 

0.07 

4076.04 

4075.88 

0.00 

50 

4 

4216.49 

4219.43 

0.07 

4220.63 

4220.72 

0.00 

8 

4230.69 

4239.95 

0.22 

4245.23 

4249.52 

0.10 

12 

4254.30 

4278.36 

0.57 

4298.42 

4314.26 

0.37 

16 

4279.15 

4318.48 

0.92 

4370.40 

4428.70 

1.33 

90 

4 

4408.98 

4425.15 

0.34 

4413.87 
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0.20 

8 

4501.65 
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1.27 
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1.45 

12 
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1.16 
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6.14 

16 
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0.02 

5767.69 

7134.15 
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Table  5.2.  Validating  lock  utilization  analysis  using  simulation  for  ICSM-R 


Lock  Utilization  (%) 

ICSM-R 

LOCK-NR 

Tc 

Processors 

Simulation 

Analysis 

Abs.% 
Diff 

Simulation 

Analysis 

Abs.% 
Diff 

10 

4 

1.00 

0.99 

1.00 

4.00 

3.90 

2.50 

8 

2.00 

2.02 

1.00 

7.70 

7.90 

2.59 

12 

3.00 

3.06 

0.99 

12.00 

11.90 

0.83 

16 

4.10 

4.10 

0.00 

15.60 

15.80 

1.28 

20 

5.10 

5.18 

1.57 

19.70 

19.81 

0.56 

24 

6.20 

6.20 

0.00 

23.00 

23.76 

3.30 

28 

7.20 

7.30 

1.39 

27.30 

27.60 

1.10 

32 

8.20 

8.30 

1.22 

30.70 

31.50 

2.60 

36 

9.20 

9.40 

2.17 

34.10 

35.40 

3.81 

40 

10.30 

10.50 

1.94 

38.30 

39.47 

3.05 

44 

11.30 

11.52 

1.95 

43.00 

43.34 

0.79 

48 

12.30 

12.60 

2.44 

45.40 

47.30 

4.18 

52 

13.30 

13.60 

2.26 

49.80 

51.20 

2.81 

56 

14.40 

14.71 

2.15 

52.50 

55.18 

5.10 

60 

15.40 

15.76 

2.34 

56.50 
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4.42 

64 

16.40 

16.81 
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59.80 

62.90 

5.18 

50 

4 
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5.01 
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18.70 

18.93 

1.23 
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37.68 
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Figure  5.6.  ICSM-R  Cycle  Times  using  analysis  for  Tc 


10000 

9000 
8000  ■ 
7000 
6000 
5000 


4000 


Tc  =  25Tq  =  100 


ICSM-R 
LOCK-NR 


0     10    20    30    40    50    60    70    80    90  100 
Processors 


Figure  5.7.  ICSM-R  Cycle  Times  using  analysis  for  Tc 
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Figure  5.8.  ICSM-R  Cycle  Times  using  analysis  for  Tc  =  50 


Figure  5.9.  ICSM-R  Cycle  Times  using  analysis  for  Tc  =  75 
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Figure  5.10.  ICSM-R  Cycle  Times  using  analysis  for  Tc  =  90 


Figure  5.11.  ICSM-R  critical  section  utilization  using  analysis  for  Tc  =  10 
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Figure  5.12.  ICSM-R  critical  section  utilization  using  analysis  for  Tc 


Figure  5.13.  ICSM-R  critical  section  utilization  using  analysis  for  Tc 
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Figure  5.14.  ICSM-R  critical  section  utilization  using  analysis  for  Tc 
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Figure  5.15.  ICSM-R  critical  section  utilization  using  analysis  for  Tc 
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Figure  5.16.  ICSM-R  effect  of  multiprogramming  on  cycle  time  for  Tc  =  25  (M  =  1) 
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Figure  5.17.  ICSM-R  effect  of  multiprogramming  on  cycle  time  for  Tc  =  25  (M  =  4) 
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Figure  5.18.  ICSM-R  effect  of  multiprogramming  on  cycle  time  for  Tc  =  25  (M  =  8) 
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Figure  5.19.  ICSM-R  effect  of  multiprogramming  on  cycle  time  for  Tc  =  25  (M  = 
12) 
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Figure  5.20.  ICSM-R  effect  of  multiprogramming  on  cycle  time  for  Tc  =  25  (M  = 
16) 
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Figure  5.21.  ICSM-R  effect  of  multiprogramming  on  cycle  time  for  Tc  =  50  (M  =  1) 
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Figure  5.22.  ICSM-R  effect  of  multiprogramming  on  cycle  time  for  Tc  =  50  (M  =  4) 
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Figure  5.23.  ICSM-R  effect  of  multiprogramming  on  cycle  time  for  Tc  =  50  (M  =  8) 
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Figure  5.24.  ICSM-R  effect  of  multiprogramming  on  cycle  time  for  Tc  =  50  (M  = 
12) 
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Figure  5.25.  ICSM-R  effect  of  multiprogramming  on  cycle  time  for  Tc  =  50  (M 
16) 
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5.3    ICSM  with  Task  Kill  fICSM-K) 

In  a  multiprocessor  system,  we  usually  encounter  a  synchronization  environment 
in  which  a  high-priority  task  synchronizes  access  to  a  critical  section  with  a  set  of 
low-priority  tasks.  For  example,  a  set  of  Client  tasks  enqueue  their  I/O  requests  in 
a  shared  queue  that  is  serviced  by  a  high  priority  I/O  Server  task.  In  a  real-time 
multiprocessor  system,  a  high  priority  Monitor  task  acquires  data  from  an  external 
source,  that  is  in-turn  analyzed  by  a  set  of  low-priority  Analysis  tasks.  Both  the 
examples  mandate  that  the  critical  section  used  to  synchronize  access  to  the  shared 
resource  should  have  a  faster  and  predictable  response  time  for  the  high  priority  task. 

The  ICSM-K  algorithm  addresses  the  issue  of  synchronization  in  the  above  envi- 
ronment. For  this  implementation  of  ICS,  we  assume  that  there  is  only  a  single  high 
priority  task,  with  a  set  of  low  priority  tasks  running  on  multiple  processors.  The 
critical  section  is  implemented  using  techniques  of  writing  interruptible  critical  sec- 
tions as  described  in  Chapter  4.  In  addition,  we  assume  the  availability  of  a  general 
synchronization  mechanism  for  multiprocessors. 

The  global  data  structures  consists  of  a  lock  object  L  and  two  semaphores,  as 
shown  in  Figure  5.26.  The  lock  L  is  composed  of  a  pointer  to  the  shared  object  and 
an  integer  Kill  flag.  One  semaphore  is  used  by  the  set  of  tasks  for  exclusive  access 
to  the  critical  section.  The  Kill  flag  is  used  by  the  high  priority  task  to  gain  access 
to  the  critical  section  if  the  critical  section  is  busy.  Another  semaphore  is  used  to 
return  the  critical  section  to  the  low  priority  task  that  was  killed  by  the  high  priority 
task.  The  algorithm  for  the  high  priority  task  is  also  presented  in  Figure  5.26. 

The  high  priority  task  tests  the  availability  of  the  mutex  semaphore.  It  enters  the 
critical  section  if  the  mutex  semaphore  is  free.  If  not,  it  sets  the  Kill  flag  and  enters 
the  critical  section  anyway.  The  Kill  flag  is  an  integer  variable  in  which  the  low  order 
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bit  is  used  as  the  Kill  bit  and  the  high  order  bits  are  used  as  the  counter  for  the  Kill 
bit  to  avoid  the  A-B-A  problem  when  used  with  a  Compare&Swap  instruction,  as 
described  in  Chapter  3.  In  the  algorithm  shown,  we  set  the  Kill  bit  and  increment 
the  counter  with  a  sigle  statement  that  increments  the  Kill  flag  by  a  value  of  3. 
After  committing  its  changes  to  the  shared  object,  the  high  priority  task  leaves  the 
critical  section  by  resetting  the  Kill  flag  and  making  the  kills  semaphore  available. 
We  note  that  the  high  priority  task  can  use  techniques  other  than  ICS  for  modifying 
the  shared  object,  as  it  is  not  interrupted  in  the  critical  section. 

The  algorithm  for  the  low  priority  tasks  is  presented  in  Figure  5.27.  A  low  priority 
task  first  waits  for  the  mutex  semaphore  to  be  free.  Then,  it  takes  a  snap-shot  of 
the  lock  object  L  until  the  Kill  bit  is  not  set.  The  low  priority  task  enters  the 
critical  section  and  constructs  the  changes  to  the  shared  object  in  the  local  variable 
MyObject.  At  the  end  of  the  critical  section,  the  low  priority  prepares  to  commit 
its  changes  to  the  shared  object  using  the  ICS  mechanism.  The  low  priority  task 
constructs  a  new  lock  object  NewL  that  points  to  its  local  MyObject.  The  Kill  flag 
value  of  NewL  is  set  to  the  value  that  was  read  before,  with  the  Kill  bit  reset. 

The  low  priority  task  commits  its  changes  by  replacing  the  lock  object  L  with 
NewL,  using  an  atomic  Compare&Swap  instruction.  The  Compare&Swap  instruction 
ensures  that  the  high  priority  task  has  not  entered  the  critical  section  in  the  meantime. 
If  the  Kill  bit  is  set,  the  Compare&Swap  fails  indicating  that  the  critical  section 
execution  was  interrupted  by  the  high  priority  task.  In  this  case,  the  low  priority 
task  waits  for  the  kills  semaphore  to  be  free,  and  the  critical  section  is  restarted  from 
the  beginning.  We  used  the  double  Compare&Swap  (CAS2)  for  testing  the  Kill  flag 
and  committing  the  critical  section  operation  simultaneously. 

To  show  that  the  ICSM-K  algorithm  is  correct,  we  can  see  that  there  can  be  a 
conflict  only  between  the  high  priority  task  and  a  single  low  priority  task.  In  this 


123 


case,  The  ICSM-K  algorithm  is  decisive-instruction  serializable  [83].  The  tasks  using 
the  ICSM-K  algorithm  have  a  single  decisive  instruction.  The  decisive  instruction 
for  the  high  priority  task  is  the  setting  of  the  Kill  flag.  The  decisive  instruction  for 
the  low  priority  task  is  the  successful  Compare&Swap  instruction  that  commits  the 
lock  L.  Corresponding  to  a  concurrent  execution  C  of  a  set  of  operations,  there  is  an 
equivalent  (with  respect  to  return  values  and  final  states)  serial  execution  Sd  such 
that  if  operation  Oi  executes  its  decisive  instruction  before  operation  O2  does  in  C, 
then  0\  <  O2  in  Sd- 

If  the  high  priority  task  commits  by  setting  the  Kill  flag  before  the  low  priority 
task  executes  its  decisive  Compare&Swap  instruction,  the  Compare&Swap  fails.  In 
this  case,  the  low  priority  task  correctly  waits  for  the  kills  semaphore  and  restarts 
its  critical  section.  If  the  high  priority  task  commits  after  the  low  priority  task 
executes  the  Compare&Swap  instruction,  the  Compare&Swap  for  the  low  priority 
task  succeeds  and  there  is  no  critical  section  conflict. 

5.3.1    Experimental  Performance  Results 

We  used  the  ICSM-K  synchronization  algorithm  to  implement  a  shared  queue 
on  the  Motorola  68030  shared-memory  multiprocessor  running  the  pSOS-|-  kernel. 
There  are  three  low  priority  enqueue  tasks  (of  equal  priority)  and  a  single  high  priority 
dequeue  task.  Each  task  runs  on  an  individual  processor.  We  studied  the  performance 
of  the  ICSM-K  under  a  moderate  lock  utilization  as  well  as  a  high  lock  utilization. 
We  compared  the  performance  of  the  ICSM-K  algorithm  with  an  algorithm  that  does 
not  use  the  Kill  flag  for  the  high  priority  flag  (LOCK-NK). 

In  the  first  experiment,  each  task  performs  10,000  enqueue  (dequeue)  operations. 
Each  enqueue  task  works  (idles)  for  140  ms  (milliseconds)  and  enters  a  10  ms  critical 
section,  using  a  combined  lock  utilization  of  20%.  The  dequeue  task  sleeps  for  60  ms 
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struct  Lock  { 

struct  Object  *Ptr; 
int  Kill-Flag; 

} 

struct  Lock  L  =  {NULL,  0} 
semaphore  mutex  =  1; 
semaphore  kills  =  0; 
ICSM-K-Task-HO 

{ 

status  =  sm_p(mutex,  No_Wait)  ; 
if (status  ==  SUCCESS) 

{ 

Use  Critical  Section; 
sm_v (mutex) ; 

} 

else 
{ 

Kill_Flag  +=3;  /*  Set  kill  and  increment  counter  */ 
Use  Critical  Section; 

Kill-Flag  =  Kill_Flag  ft  I;  /*  Reset  kill  */ 
sm_v (kills) ; 

} 


Figure  5.26.  The  global  data  structures  and  the  ICSM-K  algorithm  for  the  high 
priority  task 


ICSM-K-Task-L (struct  Object  *MyObject) 
{ 

struct  Lock  OldL,  NewL; 

status  =  sm_p(mutex,  Wait); 
do  /*  cs_restart  */ 

{ 

do 

{ 

OldL  =  L; 

if  (OldL.  Kill-Flag  &  1  ==  TRUE  ) 
sm_p(kills,  Wait); 
}  while  (OldL.  Kill-Flag  &  1  ==  TRUE); 
Use  Critical  Section; 
NewL.Ptr  =  MyObject; 
NewL.  Kill-Flag  =  OldL.  Kill-Flag; 
status  =  cas2(&L,  OldL,  NewL);  /*  cs  commit  */ 
}while(status  ==  FAILURE); 
sm_v(mutex) ; 


Figure  5.27.  The  ICSM-K  algorithm  for  the  low  priority  task 
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before  entering  a  10  ms  critical  section.  We  collected  the  time  to  execute  the  critical 
section,  and  we  created  histograms  that  shows  the  frequency  that  the  critical  section 
execution  takes  a  particular  amount  of  time. 

The  performance  results  of  the  LOCK-NK  algorithm  and  the  ICSM-K  algorithm 
for  20%  lock  utilization  are  shown  in  Figure  5.28  and  Figure  5.29,  respectively.  In  a 
critical  section  using  LOCK-NK  algorithm,  the  dequeue  operation  sometimes  experi- 
ences a  delay.  By  using  the  ICSM-K  algorithm,  the  high  priority  dequeue  operation 
is  always  performed  without  delay,  while  the  enqueue  operations  experience  only  a 
small  degradation. 

In  the  next  experiment,  the  work  time  of  the  enqueue  tasks  is  set  to  100  ms  while 
the  critical  section  time  is  extended  to  50  ms  to  reflect  high  lock  utilization  of  100%. 
The  parameters  for  the  dequeue  task  remain  the  same  as  before.  Figure  5.30  and 
Figure  5.31  show  the  time  to  enqueue  and  dequeue  using  LOCK-NK  and  ICSM-K 
algorithms.  The  delay  in  the  dequeue  operation  is  eliminated  by  ICSM-K,  but  with 
a  significant  increase  in  the  response  time  of  enqueue  operations.  This  is  because  of 
the  100%  lock  utilization  by  the  enqueue  tasks. 

A  frequently  encountered  situation  in  a  multiprocessor  environment  is  a  high  pri- 
ority task  sharing  a  resource  with  a  set  of  low  priority  tasks.  Existing  synchronization 
mechanisms  block  the  high  priority  task  from  executing  the  critical  section  when  a 
low  priority  task  is  in  the  critical  section.  The  critical  section  execution  time  for 
the  high  priority  task  varies  depending  on  the  resource  utilization  of  the  low  priority 
tasks,  resulting  in  the  unpredictable  response  time  of  the  high  priority  task.  By  using 
the  ICSM-K  algorithm  for  sharing  a  resource,  the  critical  section  response  time  for 
the  high  priority  task  is  always  predictable  irrespective  of  the  utilization  by  the  low 
priority  tasks,  by  eliminating  priority  inversion.  This  desired  feature  of  the  ICSM-K 
algorithm  makes  it  suitable  for  a  real-time  multiprocessor  system. 
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Figure  5.28.  Response  time  distribution  of  LOCK-NK  (20%  Lock  Utilization) 
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Figure  5.29.  Response  time  distribution  of  ICSM-K  (20%  Lock  Utilization) 
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Figure  5.30.  Response  time  distribution  of  LOCK-NK  (100%  Lock  Utilization) 
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Figure  5.31.  Response  time  distribution  of  ICSM-K  (100%  Lock  Utilization) 
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5.4    ICSM  with  Priority  Queue  (ICSM-P) 

Here,  we  consider  a  general  shared-memory  multiprocessor  real-time  system  with 
tasks  of  varying  priorities  accessing  a  shared  critical  section. 

In  the  ICSM-P  algorithm,  we  will  allow  a  task  with  a  higher  priority  to  kill  a  task 
with  lower  priority  that  is  using  the  lock.  The  killed  lower  priority  task  has  to  restart 
its  critical  section  execution  using  the  ICS  mechanism.  We  will  use  the  PR-Lock 
algorithm  described  in  Chapter  3  to  implement  the  ICSM-P  algorithm. 

The  PR-Lock  uses  a  spin-lock  for  synchronization.  Tasks  wait  for  the  lock  to  be 
released  in  a  queue.  The  queue  is  ordered  according  to  the  priorities  of  the  waiting 
tasks.  The  task  that  is  currently  using  the  lock  is  at  the  head  of  the  queue  and  may 
have  a  lower  priority  than  the  other  waiting  tasks. 

A  waiting  higher  priority  task  can  be  blocked  by  at  least  a  critical  section  execution 
time  using  the  PR-Lock.  This  priority  inversion  can  be  avoided  by  using  the  ICS 
mechanism.  We  modify  the  PR-Lock  such  that  if  the  first  task  ph  waiting  has  a 
higher  priority  than  the  task  pc  currently  using  the  critical  section,  then  ph  interrupts 
the  critical  section  execution  of  pc  and  enters  the  critical  section.  Task  pc  detects 
this  interruption  during  its  commit-and-release  phase  and  restarts  to  acquire  the 
lock  again.  Thus,  each  process  uses  the  acquireJock  and  commit_releaseJock 
operations  to  synchronize  access  to  the  critical  section  in  the  following  way. 

do  { 

acquireJock  (L,  r) 
critical  section 

status  =  releaseJock(L,  r,  Commit -Parameters) 
}  while  (status  ==  FAILURE) 

The  following  sub-sections  present  the  required  modifications  to  the  PR-Lock 
algorithm  and  the  acquireJock  and  commit_releaseJock  procedures. 
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5.4.1  Implementation 

We  modify  the  PR-Lock  algorithm  to  implement  the  ICSM-P  algorithm  as  fol- 
lows. We  add  an  apriority  field  to  the  task  lock  record  structure  that  contains  the 
task's  actual  priority.  The  priority  field  in  the  record  lock  structure  is  modified 
by  the  PR-Lock  algorithm.  We  use  Compare&Swap2,  another  form  of  double-word 
Compare&Swap  in  which  the  addresses  of  the  two  words  are  not  consecutive.  The 
semantics  of  the  Compare&;Swap2  used  is  given  in  Figure  5.32.  The  Dq  bit,  the  Dq 
count  to  avoid  the  A-B-A  problem,  and  the  next  record  pointer  can  be  implemented 
in  a  single  word  of  32  bits  by  using  an  index  for  the  next  record  address  instead 
of  a  pointer.  If  too  many  high  priority  tasks  are  allowed  to  acquire  a  lock  without 
blocking,  low  priority  tasks  might  experience  an  excessive  number  of  restarts,  increas- 
ing the  response  time  and  decreasing  the  schedulability  of  the  task  set  (Chapter  4 
section  4.5).  In  order  to  limit  re-executions,  we  introduce  a  minimum  kill  priority 
Pr_KILL.  A  task  p<  with  a  priority  Pr(pt  )  can  interrupt  a  task  pj  with  priority  Pr(pj) 
only  if  Pr(pi)  >  Pr.KILL  and  Pr(pf)  >  Pv(pj). 

Each  process  keeps  the  address  of  its  record  in  a  local  variable  (Self).  In  addition, 
each  process  requires  two  local  pointer  variables  to  hold  the  previous  and  the  next 
queue  element  for  navigating  the  queue  during  the  enqueue  operation  (PrevJJode 
and  Next_Node). 

The  data  structures  used  are  shown  in  Figure  5.33.  The  Dq  bit  of  the  Pointer 
field  is  initialized  to  TRUE,  and  the  Ctr  field  is  initialized  to  0  before  the  record  is 
first  used. 

Acquire_Lock  Operation 

The  modification  to  the  PR-Lock  acquireJock  operations  are  marked  with  M  in 
Figure   5.34.  A  process  ph  enqueues  its  lock  record  as  in  the  PR-Lock  algorithm. 
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Procedure  CAS2 (structure  pointer  *C1,  *C2,  *01,  *02,  *N1,  *N2) 

/*  Assume  CAS  operates  on  two  different  words  */ 

atomic{ 

if(  *ci  ==  *01  &&  *C2  ==  *02  )  { 
*C1  =  *N1;  *C2  =  *N2; 
return (TRUE) ; 

} 

else  { 

*01  =  *C1;  *02  =  *C2; 
return (FALSE) ; 

} 

} 


Figure  5.32.  CAS2  used  in  the  ICSM-P  Algorithm 

Before  spinning  for  the  lock  variable,  ph  checks  to  see  if  it  is  eligible  to  kill  any  other 
process  and  also,  if  the  process  pc  in  front  is  with  a  lower  priority.  If  the  condition 
for  interruption  is  satisfied,  then  process  ph  attempts  to  acquire  the  lock  for  itself 
by  using  Compare&Swap2  to  make  lock  L  to  point  to  its  own  record.  Note  that  for 
Pr(pc)  <  Pr(p>t),  process  pc  must  be  the  current  lock  user. 

The  conditions  for  the  successful  Compare&Swap2  are  as  follows.  If  the  lock  L 
is  pointing  to  the  record  qc  of  process  pc  and  qc  (with  the  Dq  bit  not  set)  points  to 
record  qh  of  process  p^,  then  make  the  lock  L  to  point  to  q^  and  set  the  Dq  bit  in  qc. 
This  is  illustrated  in  Figure  5.35.  If  the  Compare&Swap2  is  successful,  then  process 
Ph  proceeds  to  use  the  critical  section. 

There  are  two  cases  for  an  unsuccessful  attempt  to  interrupt  pc. 

1.  Process  pc  may  commit  and  release  the  lock  by  setting  the  Dq  bit  in  qc  and 
making  the  lock  L  to  point  to  q^. 


structure  Pointer  { 

structure  Object  *Ptr; 
int31  Ctr; 
boolean  Dq; 

} 

structure  Record  { 

structure  structurejof -data  Data; 
boolean  Locked; 
integer  Apriority; 
integer  Priority; 
structure  Pointer  Next; 

} 

Shared  Variable 

structure  Pointer  L; 

Private  Variables 

structure  Pointer  Self,  PrevJIode,  Next_node; 
boolean  Success,  Failure; 
constant  TRUE,  FALSE,  NULL,  MAX_PRIORITY; 
constant  KILL-PRIORITY; 


Record  Structure 


Data 

Locked 

Apriority 

Priority 

Next.Ptr 

Next.Ctr 


Pointer  P 


P.Ptr 
P.Ctr 


Next.Dq 


P.Dq 


Figure  5.33.  Data  Structures  used  in  the  ICSM-P  Algorithm 
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2.  A  concurrent  acquire  Jock  operation  of  another  process  pg  with  Pr(ps)  > 
Pr(ph)  may  overtake  the  Compare&Swap2  instruction  of  process  ph.  Then, 
qc  points  to  record  qg  of  process  pg  and  no  longer  points  to  record  q^. 

In  either  case,  the  Compare&Swap2  instruction  of  process  ph  fails  and  process  ph 
spins  for  the  lock  as  in  the  PR-Lock.  In  the  first  case,  process  ph  is  about  to  acquire 
the  lock  anyway.  In  the  second  case,  process  ph  is  no  longer  the  highest  priority 
process  among  waiting  processes  to  interrupt  process  pc. 

Commit _Release-Lock  Operation 

We  use  the  Compare&Swap  instruction  to  set  the  Dq  bit  for  release  because  of 
the  nature  of  the  acquire  Jock  operation  and  because  the  Dq  bit  is  part  of  the  next 
record  pointer. 

If  the  Compare&Swap  is  successful,  then  the  commit_releaseJock  operation  pro- 
ceeds to  the  commit  phase.  Otherwise,  a  higher  priority  process  ph  has  interrupted 
the  critical  section  execution  before  the  commit  and  the  commit_releaseJock  opera- 
tion returns  a  failure  status.  The  failed  process  requeues  by  calling  the  acquire  Jock 
operation  again. 

We  tie  the  commit  phase  of  the  critical  section  to  the  release  Jock  operation  of 
the  PR-Lock,  i.e.,  the  commit  pointer  is  updated  only  after  the  Dq  bit  is  set  with  a 
successful  Compare&Swap.  In  this  case,  the  commit_releaseJock  operation  returns  a 
success.  The  modified  commit  .release  Jock  operation  is  shown  in  Figure  5.36. 

5.4.2    Correctness  of  the  ICSM-P  Algorithm 

In  this  section,  we  present  an  informal  argument  for  the  correctness  properties  of 
our  ICSM-P  algorithm.  We  prove  that  the  ICSM-P  algorithm  is  correct  by  showing 
that  it  maintains  a  priority  queue,  and  the  head  of  the  priority  queue  is  the  process 
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Procedure  acquireJock(L,  Self)  { 
Success=FALSE; 
do  { 

Prev-Node=NULL;  Kext_Node=NULL 

if(CAS(ftL,  &Next_Node,  ftSelf))  {  /*  No  Lock  */ 

Success=TRUE;  Failure=FALSE;  /*  Use  Lock  */ 

Self .  Ptr->Priority  =  MAX-PRIORITY; 

Self . Ptr->Next . Dq=FALSE ; 

} 

else  {  /*  Lock  in  Use  */ 

Failure=FALSE ;  Self . Ptr->Locked=TRUE ; 
do  { 

Prev-Node=Next-Node ; 

Next  _Node=Prev-Node .  Ptr->Next ; 

if ( (Next-Node. Dq==TRUE)  /*  Deque,  Try  Again  */  ii 
or  (PrevJiode.Ptr->Priority<Self . Ptr->Priority) )  iii 
Failure=TRUE; 
else  { 

if  (Next-Node.  Ptr==NULL  or  (Next -Node.  Ptr!=NULL  and 

Next-Node . Ptr->Priority<Self .Ptr->Priority ) ) { 
Self .  Ptr->Next .  Ptr=Next-Node .  Ptr 

if(CAS(ft(PrevJTode.Ptr->Next),  &Next_Node,  ftSelf))  {  F 
Self . Ptr->Next . Dq=FALSE ; 

if (Self . Ptr->Apriority  >=  KILL-PRIORITY  and  M 
Self .Ptr->Apriority  >  Prev_Node.Ptr->Apriority)  {  M 
Next-Node. Ptr  =  Self .Ptr;  M 
Temp-Node  =  Next -Node;  Temp -Node.  Dq  =  TRUE;  M 
if(CAS2(ftL,  &(Prev_Uode.Ptr->Next),  &Prev_Node,  M 
ftNext-Node,  ftSelf,  &Temp_Node  ))  M 
Self .Ptr->Locked  =  FALSE;  M 
}  M 
while(Self .Ptr->Locked) ;  /*  Busy  Wait  */ 
Success=TRUE;  /*  Then,  use  lock  */ 

} 

else  { 

if ( (Next-Node. Dq==TRUE)  /*  Deque,  Try  Again  */  ii 
or  Prev_Node.Ptr->Priority  <  Self .Ptr->Priority))  iii 
Failure=TRUE; 
else 

Next-Node=Prev_Node;  i 

} 

} 

}while(  !  Success  and  '.Failure); 

}. 

}while( ! Success) ; 

} 


Figure  5.34.  The  acquire  Jock  operation  procedure  for  ICSM-P 


135 


Prev 


Next 


\ 


Prev 


\ 


\ 

ql 

qn 

Next 


Before  ICSM-P 


qh 

\ 

ql 

qn 

After  Enqueue 


Temp 


\ 


T 

qh 

ql 

qn 

/ 

After  Kill 


Figure  5.35.  A  successful  acquireJock  operation  for  ICSM-P 


Procedure  commit_release Jock(L ,  Self,  Commit_Parameters) { 
Status  =  FAILURE; 
Old.Value  =  Self  .  Ptr->Next ; 
while  (Self  .Ptr->Next.Dq  ==  FALSE) 


{ 


New.Value  =  Old-Value ;  New_Value.Dq  =  TRUE; 
if(CAS(&Self  .Ptr->Next,  &01d_Value,  &New_Value)) 

{ 

Commit  the  critical  section  operation; 

L  =  Self  .Ptr->Next; 

if  (Self  .Ptr->Next  !=  NULL){ 

L .  Ptr->Priority=MAX_PRIORITY ; 

L .  Ptr->Locked=FALSE ; 

} 

Status  =  SUCCESS; 


} 

return (Status) ; 


Figure  5.36.  The  commit_release  Jock  operation  procedure  for  ICSM-P 
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that  holds  the  lock.  For  the  correctness  argument,  we  consider  the  acquireJock 
algorithm  to  be  consisting  of  two  operations:  the  acquireJock  operation  as  in  the 
PR-Lock  algorithm  and  the  killJock  operation  in  which  a  higher  priority  process 
attempts  to  preempt  a  lower  priority  process  that  is  using  the  lock.  The  ICSM- 
P  algorithm  is  decisive-instruction  serializable  [83].  All  operations  of  the  ICSM-P 
algorithm  have  a  single  decisive  instruction.  The  decisive  instruction  for  the  ac- 
quireJock operation  is  the  successful  Compare&Swap.  The  decisive  instruction  for 
the  releaseJock  operation  is  the  successful  Compare&Swap  instruction  that  sets 
the  Dq  bit.  The  decisive  instruction  for  the  killJock  operation  is  the  successful 
Compare&Swap2  instruction  that  sets  the  lock  L  and  the  Dq  bit  simultaneously. 
Corresponding  to  a  concurrent  execution  C of  the  queue  operations,  there  is  an  equiv- 
alent (with  respect  to  return  values  and  final  states)  serial  execution  Sd  such  that 
if  operation  0\  executes  its  decisive  instruction  before  operation  O2  does  in  C,  then 
0\  <  O2  in  Sd-  Thus,  the  equivalent  priority  queue  of  a  ICSM-P  lock  is  in  a  single 
state  at  any  instant,  simplifying  the  correctness  proof  (a  concurrent  data  structure 
that  is  linearizable  but  not  decisive-instruction  serializable  might  be  in  several  states 
simultaneously  [37]). 

We  use  the  following  notation  in  our  discussion.  ICSM-P  lock  C  has  lock  pointer 
L,  which  points  to  the  first  record  in  the  lock  queue  (and  the  record  of  the  process  that 
holds  the  lock).  Let  there  be  N  processes  pi,  p2,  . . .,  pn  that  participate  in  the  lock 
synchronization  for  a  priority  lock  £,  using  the  ICSM-P  algorithm.  As  mentioned 
earlier,  each  process  pi  allocates  a  record  to  enqueue  and  dequeue.  Thus,  each 
process  p,  participating  in  the  lock  access  is  associated  with  a  queue  record  9,.  Let 
Pr(pi)  be  a  function  which  maps  a  process  to  its  priority,  a  number  between  1  and  N. 
We  also  define  another  function  Pr(qi)  which  maps  a  record  belonging  to  a  process 
Pi  to  its  priority. 
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A  priority  queue  is  an  abstract  data  type  that  consists  of 

•  A  finite  set  Q  of  elements  q,-,  i  =  1. .  .N 

•  A  function  Pr  :  — ►  n,  ,where  rii  €  N.  For  simplicity,  we  assume  that  every  nt- 
is  unique.  This  assumption  is  not  required  for  correctness,  and  in  fact  processes 
of  the  same  priority  will  obtain  the  lock  in  FCFS  order. 

•  Three  operations  enqueue,  dequeue,  and  kill. 

At  any  instant,  the  state  of  the  queue  can  be  defined  as 

Q  =  (ft,ft,ft,-..,ft) 

where  qx  <Q  q2  <q  . . .  <q  ft,  and  q{  <Q  qj  iff  Pr(qi)  >  Pr(qj). 

We  call  q0  the  head  record  of  priority  queue  Q.  The  head  record's  process  is  the 
current  lock  holder.  Note  that  the  non-head  records  are  totally  ordered. 

The  enqueue  operation  is  defined  as 

enqueue((q0,  ft, ft, ... ,  g„),  q)  -*  {q0,  ft,  ft, . . . ,  ft,  q,  ...,?») 

where  Pr(ft)  >  Pr(?)  >  Pr(ft+i). 

The  dequeue  operation  on  a  queue  is  defined  as 

dequeue((qo,  ft,  q2,  • . . ,  qn  ),«)   -»   (ft>ft?--->9n)  <  SUCCESS  > 

iff  i  =  0 

->   (9o, ft, ft,--., ft)  <  FAILURE  > 
Otherwise 

where  SUCCESS  and  FAILURE  are  return  values. 
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The  kill  operation  on  a  queue  is  denned  as 


kill((q0,qi,q2 


,---,9n),9i)     -»  (?1,92 


,...,qn)<  success  > 


iff  i  =  1  A  Pr(ft)  >  Pr(go) 


(9o,9i,92 


,...,?„)<  FAILURE  > 


Otherwise 


For  every  ICSP-M  £,  there  is  an  abstract  priority  queue  Initially,  both  C  and 
(5  are  empty.  When  a  process  p  with  a  record  q  performs  the  decisive  instruction 
for  the  acquire  Jock  operation,  Q  changes  state  to  enqueue  (Q,q).  Similarly,  when 
a  process  executes  the  decisive  instruction  for  a  releaseJock  operation,  Q  changes 
state  to  dequeue  (Q,q). 

We  show  that  when  we  observe  £,  we  find  a  structure  that  is  equivalent  to  Q. 
To  observe  £,  we  take  a  consistent  snapshot  [15]  of  the  current  state  of  the  system 
memory.  Next,  we  start  at  the  lock  pointer  L  and  observe  the  records  following 
the  linked  list.  If  the  head  record  has  its  Dq  bit  set  and  its  process  has  exited  the 
acquire  Jock  operation,  then  we  discard  it  from  our  observation.  If  we  observe  the 
same  records  in  the  same  sequence  in  both  C  and  Q,  then  we  say  that  L  and  Q  are 
equivalent,  and  we  write  L  Q. 

Theorem  1  The  representative  priority  queue  Q  is  equivalent  to  the  observed  queue 
of  the  PR-Lock  C. 

Proof.  We  prove  the  theorem  by  induction  on  the  decisive  instructions,  using  the 
following  three  lemmas. 


Lemma  1  If  Q  L  before  a  releaseJock  decisive  instruction,  then  Q  &  C  after 
the  releaseJock  decisive  instruction. 
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Before:  L 
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Figure  5.37.  Observed  queue  C  before  and  after  a  releaseJock 

Proof:  Let  Q  =  (qo,  q\,  q<i,  •  •  • ,  In)  before  a  releaseJock  decisive  instruction.  A 
releaseJock  operation  is  equivalent  to  a  dequeue  operation  on  the  abstract  queue. 
By  definition, 

dequeue((qo,  qi,  #21 . . . ,  qn  ),9>)-*fa,ta,...,q*)<  SUCCESS  > 

Let  L  point  to  the  record  q0  of  process  p0.  If  there  is  no  other  concurrent  operation, 
the  releaseJock  Compare&Swap  decisive  instruction  of  po  changes  the  Dq  bit  in  qo 
from  FALSE  to  TRUE,  removing  qo  from  the  observable  queue.  The  before  and  after 
states  of  C  are  shown  in  Figure  5.37.  In  this  case,  the  releaseJock  operation  returns 
a  successful  status.  Thus,  Q  £  after  the  releaseJock  operation.  Note  that  L  will 
point  to  q\  before  the  next  releaseJock  decisive  instruction. 

A  concurrent  kill  Jock  operation  K  of  process  p\  may  overtake  the  releaseJock 
operation  and  sets  the  Dq  bit  in  q0  to  TRUE.  Thus,  Q  =  (qi,q2,  •  •  •  ,qn)  before  the 
releaseJock  operation.  By  definition, 

dequeue((quq2,...,qn),q0)  -+  (qu  q2,  ■  ■ . ,  qn)  <  FAILED  > 

In  this  case,  the  Compare&Swap  instruction  of  the  releaseJock  operation  fails  and 
the  releaseJock  operation  returns  a  failed  status.  Therefore,  Q  C  after  the 
releaseJock  operation.  □ 
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Lemma  2  If  Q  &  L  before  a  killJock  decisive  instruction,  then  Q  £  after  the 
kill-lock  decisive  instruction. 

Proof.  Let  Q  =  (90,91,92,  •  •  •  ,  9n)  before  a  killJock  decisive  instruction.  A 
killJock  operation  is  equivalent  to  a  kill  operation  on  the  abstract  queue.  By  defi- 
nition, 

kiU((q0,q1,q2,...,qn),qi)      (go,  9i,  92,  •  •  • ,  qn)  <  FAILURE  > 

for  i  7^  1.  The  killJock  operation  of  pi  uses  the  PrevJMode  pointer  at  which  the 
calling  process  has  enqueued  in  the  Compare&Swap2  instruction.  For  the  Com- 
pare&Swap2  instruction  to  be  successful,  the  lock  L  should  point  to  the  record  qj 
in  the  PrevJMode  and  the  next  pointer  in  the  record  qj  must  point  to  the  record 
9i.  This  is  true  only  if  j  =  0  and  i  =  1.  As  the  Compare&Swap2  instruction  is 
atomic,  this  condition  is  not  affected  by  any  other  concurrent  operation.  If  i  ^  1, 
Compare&Swap2  fails  and  is  regarded  as  the  failure  of  the  killJock  operation.  Thus, 
Q  ^  Cm  this  case. 

Next,  we  consider  the  case  where  i  =  1.  By  definition, 

H/((9o,9i,92,...,9„),9i)  -+  (gi,g2,...,g„)  <  SUCCESS  > 

Let  L  point  to  the  record  90  of  process  po.  If  there  is  no  other  concurrent  operation, 
the  killJock  Compare&Swap2  decisive  instruction  of  pi  changes  the  lock  L  to  point 
to  91  and  the  Dq  bit  in  90  from  FALSE  to  TRUE,  removing  q0  from  the  observable 
queue.  The  before  and  after  states  of  C  are  shown  in  Figure  5.38.  In  this  case,  the 
killJock  operation  returns  a  successful  status.  Thus,  Q  C  after  the  killJock 
operation. 

A  concurrent  release  Jock  operation  R  of  process  p0  may  overtake  the  killJock 
operation  and  sets  the  Dq  bit  in  90  to  TRUE.  Thus,  Q  =  (91,92,. .  .,qn)  before  the 
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Figure  5.38.  Observed  queue  C  before  and  after  a  killJock 

kill-lock  operation.  By  definition, 

kill((qi,q2,...,qn),qi)  ~*  (ft,  #2,  •  •  • ,  9n)  <  FAILED  > 

There  are  two  cases  to  consider.  In  the  first  case,  process  p0  is  successful  in  setting 
the  Dq  bit  in  q0  to  TRUE.  In  the  second  case,  in  addition  to  the  above,  process  p0 
changes  the  lock  L  to  point  to  qx.  In  both  cases,  the  Compare&:Swap2  instruction  of 
the  killJock  operation  fails  and  returns  a  failed  status.  Therefore,  Q  C  after  the 
killJock  operation. 

A  concurrent  acquire  Jock  operation  A  of  process  p  may  overtake  the  killJock 
operation  and  enqueues  q  between  po  and  qo.  Thus,  Q  =  (qoiQiQiiQ2,  ■  •  •  59n)  before 
the  killJock  operation.  By  definition, 

kill((q0,q,q1,q2,...,qn),qi)  -»  (go,  9,  9i,  9a,  •  •  • ,  ?n)  <  FAILED  > 

In  this  case,  the  Compare&Swap2  instruction  of  the  killJock  operation  fails  as  even 
though  the  lock  L  is  pointing  to  <7o,  qo  is  not  pointing  to  qi  any  more.  Therefore, 
Q      C  after  the  killJock  operation.  □ 

Lemma  3  If  Q  C  before  an  acquire  Jock  decisive  instruction,  then  Q  C  after 
the  acquireJock  decisive  instruction. 


Proof.  There  are  two  different  cases  to  consider: 
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Figure  5.39.  Observed  queue  C  before  and  after  an  acquire  Jock  to  an  empty  queue 
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Figure  5.40.  Observed  queue  C  before  and  after  an  acquireJock  to  a  non-empty 
queue 

Case  1:  Q  =  ()  before  the  acquireJock  decisive  instruction.  The  equivalent 
operation  on  the  abstract  queue  Q  is  the  enqueue  operation.  Thus, 

enqueue((),  q)  — >  (q) 

If  the  lock  C  is  empty,  <fs  process  executes  a  successful  decisive  Compare&Swap 
instruction  to  make  L  to  point  to  q  and  acquires  the  lock  (Figure  5.39). 

Clearly,  Q  &  C  after  the  acquireJock  decisive  instruction. 

Case  2:  Q  =  (qo,  qi,  92,  •  •  • ,  Qn)  before  the  acquireJock  decisive  instruction.  The 
state  of  the  queue  Q  after  the  acquireJock  is  given  by 

enqueue  =  ((q0,  qu  q2, . . . ,  qn),  q)  -*  (q0,  qu  q2, . . . ,  ft,  q,qi+i, . . . ,  qn) 

The  corresponding  £  before  and  after  the  acquireJock  is  shown  in  Figure  5.40. 
The  pointers  P  and  N  are  the  Prev.Node  and  Next-Node  pointers  by  which  <?'s  ac- 
quireJock operation  positions  its  record  such  that  the  process  observes  Pr(q{)  > 
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Pr(q)  >  Pr(qi+i).  Then,  the  Next  pointer  in  q  is  is  set  to  the  address  of 
The  Compare&Swap  instruction,  marked  F  in  Figure  5.34,  attempts  to  make  the 
Next  pointer  in  qi  point  to  q.  If  the  Compare&Swap  instruction  succeeds,  then  it 
is  the  decisive  instruction  of  <f  s  process  and  the  resulting  queue  C  is  illustrated  in 
the  Figure  5.40.  This  is  equivalent  to  Q  after  the  enqueue  operation.  If  the  Com- 
pare&Swap  succeeds  only  when  qi  is  in  the  queue,  qi+i  is  the  successor  record,  and 
Pr(qi)  <  Pr(q)  <  Pr(qi+1),  then  Q  £. 

If  there  are  no  concurrent  operations  on  the  queue,  we  can  observe  that  the  P 
and  N  are  positioned  correctly  and  the  Compare&Swap  succeeds.  If  there  are  other 
concurrent  operations,  they  can  interfere  with  the  execution  of  an  acquireJock 
operation,  A.  There  are  three  possibilities: 

Case  a:  Another  acquireJock  A  'enqueued  its  record  q'  between  qi  and  <7,-+i,  but 
qi  has  not  yet  been  dequeued.  If  Pr(qi)  >  Pr(q)  >  Pr(gt+i),  then  q's  process  will 
attempt  to  insert  q  between  and  <7,+i.  Process  A'  has  modified  q^s  next  pointer, 
so  that  g's  Compare&Swap  will  fail.  Since  qi  has  not  been  dequeued,  Pr(qi)  > 
Pr(q),  and  <fs  process  should  continue  its  search  from  qi,  which  is  what  happens.  If 
Pr(qi+\)  >  Pr(q)  then  ^'s  process  can  skip  over  q'  and  continue  searching  from  qi+i, 
which  is  what  happens.  This  scenario  is  illustrated  in  Figure  5.41. 

Case  b:  A  release  Jock  operation  R  overtakes  A  and  removes  9,  from  the  queue 
(i.e.,  R  has  set  q^s  Dq  bit),  and  qi  has  not  yet  been  returned  to  the  queue  (its  Dq  bit 
is  still  false).  Since  qi  is  not  in  the  lock  queue,  A  is  lost  and  must  start  searching 
again.   Based  on  its  observations  of  qi  and  A  may  have  decided  to  continue 

searching  the  queue  or  to  commit  its  operation.  In  either  case  A  sees  the  Dq  bit 
set  and  fails,  so  A  starts  again  from  the  beginning  of  the  queue.  This  scenario  is 
illustrated  in  Figure  5.42. 
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Figure  5.41.  A  concurrent  acquire  Jock  A  'succeeds  before  A 


Case  c:  A  release  Jock  operation  R  overtakes  A  and  removes  qi  from  the  queue, 
and  then  <ft  is  put  back  in  the  queue  by  another  acquire  Jock  A  '.  If  A  tries  to  commit 
its  operation,  then  the  pointer  in  qi  is  changed,  so  the  Compare&Swap  fails.  Note 
that  even  if  qi  is  pointing  to  <7,+i ,  the  version  numbers  prevent  the  decisive  instruction 
from  succeeding.  If  A  continues  searching,  then  there  are  two  possibilities  based  on 
the  new  value  of  Pr(qi).  If  Pr(q)  >  Pr(qi),  A  is  lost  and  cannot  find  the  correct 
place  to  insert  q.  This  condition  is  detected  when  the  priority  of  qi  is  examined  (the 
lines  marked  iii  in  Figure  5.34),  and  operation  A  restarts  from  the  head  of  the  queue. 
If  Pr(q)  <  Pr(qi),  then  A  can  still  find  a  correct  place  to  insert  q  past  <j,-,  and  A 
continues  searching.  This  scenario  is  illustrated  in  Figure  5.43. 

No  matter  what  interference  occurs,  A  always  takes  the  right  action.  Therefore, 
Q  O-  C  after  the  acquireJock  decisive  instruction.  □ 

To  prove  the  theorem  we  use  induction.  Initially,  Q  =  ()  and  L  points  to  nil. 
So,  Q  C  is  trivially  true.  Suppose  that  the  theorem  is  true  before  the  ith  decisive 
instruction.  If  the  ith  decisive  instruction  is  for  an  acquireJock  operation,  Lemma  3 
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Figure  5.42.  A  concurrent  releaseJock  R  succeeds  before  A 

Q  C  after  the  ith  decisive  instruction.  If  the  ith  decisive  instruction  is  for 
a  releaseJock  operation,  Lemma  1  =>  Q  &  C  after  the  ith  decisive  instruction. 
Therefore,  the  inductive  step  holds,  and  hence,  Q      C.  □ 

5.4.3    Experimental  Performance  Results 

We  compared  the  performance  of  the  MCS-Lock  and  the  PR-Lock  algorithms 
with  the  ICSM-P  algorithm.  We  selected  two  versions  of  the  ICSM-P  algorithm,  one 
with  a  single  process  that  has  the  kill  ability  (ICSM-P1K),  and  another  one  with 
two  processes  with  the  kill  ability  (ICSM-P2K).  We  again  used  the  Motorola  68030 
multiprocessor  system  with  shared-memory  running  the  pSOS-f  kernel. 

We  ran  the  experiments  with  four  processors,  each  processor  running  the  task 
that  uses  a  critical  section  with  the  lock.  The  tasks  are  identified  with  the  processor 
number  on  which  they  are  running,  from  1  to  4.  The  priorities  are  assigned  such  that 
higher  task  numbers  have  higher  priorities. 

Each  task  works  (idles)  for  an  uniformly  distributed  random  time  between  20 
and  40  ms  (milli  seconds)  before  entering  a  critical  section.  The  critical  section  time 
is  also  a  random  variable,  uniformly  distributed  about  a  select  mean  by  25%.  We 
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Figure  5.43.  Release  Jock  R  and  acquire  Jock  A  '  succeed  before  A 

varied  the  critical  section  time  used  by  each  processor  from  2.6  to  13  ms  to  achieve 
lock  utilizations  of  25%,  50%,  75%  and  100%.  We  stopped  the  experiment  as  soon 
as  any  task  completes  10,000  acquire  /  release  lock  operations  and  collected  the  time 
to  execute  the  critical  section.  We  created  histograms  that  show  the  frequency  that 
the  critical  section  execution  takes  a  particular  amount  of  time. 

The  performance  results  for  the  lock  utilization  25%,  50%,  and  75%  are  presented 
in  Table  5.3  and  the  results  for  100%  lock  utilization  are  presented  in  Table  5.4. 

First,  let  us  considering  the  results  for  lock  utilization  of  100%  only.  0.75%  of 
the  critical  section  executions  of  the  the  highest  priority  task  (task  4)  had  the  worst 
case  response  time  of  30  ms  using  the  MCS-Lock  algorithm.  The  PR- Lock  algorithm 
reduced  this  worst  case  response  to  20  ms  for  task  4.  This  improvement  is  at  the  cost 
of  increasing  the  frequency  of  the  critical  section  executions  that  took  30  ms  for  task 
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3.  There  is  also  a  cost  of  increasing  the  worst  case  response  time  from  30  ms  to  40 
ms  for  task  1  and  task  2. 

For  task  4,  ICSP-1K  algorithm  reduced  the  frequency  of  executing  the  critical 
section  for  20  ms  from  33.30%  to  .70%.  This  improvement  is  achieved  at  the  penalty 
of  a  10  ms  increase  in  the  worst  case  response  time  for  tasks  2  and  3,  and  a  20  ms 
increase  for  task  1.  ICSP-2K  algorithm  reduces  the  worst  case  response  time  for  task 
3  from  40  ms  to  30  ms.  In  this  case,  the  worst  case  response  time  for  task  1  becomes 
100  ms  and  the  worst  case  response  time  for  task  2  becomes  60. 

There  is  a  similar,  but  less  significant  improvement  for  the  other  lock  utilizations. 
One  reason  for  a  small  improvement  during  low  lock  utilization  is  due  to  the  inability 
to  measure  time  accurately  in  pSOS+  operating  system.  The  other  reason  is  that 
the  overhead  of  the  ICSM-P  algorithm  becomes  negligible  as  the  lock  utilization 
increases. 

As  shown,  ICSM-P1K  algorithm  reduces  the  response  time  of  the  highest  priority 
task  4,  and  ICSM-P2K  reduces  the  response  times  of  high  priority  tasks  3  and  4. 
The  improvement  in  response  time  is  more  pronounced  for  higher  lock  utilizations. 
As  expected,  there  is  an  increase  in  the  response  time  of  the  low  priority  tasks  while 
using  the  ICSM-P  algorithm. 

5.4.4    ICSM-P  algorithm  with  single  word  CAS 

We  present  a  more  aggressive  ICSM-P  algorithm  that  uses  a  single  word  Com- 
pare&Swap  instead  of  the  two-word  Compare&Swap  presented  before. 

In  the  acquire  Jock  operation,  the  higher  priority  task  ph  will  try  to  set  the  Dq  bit 
of  the  lower  priority  task  pc  currently  using  the  lock  to  TRUE,  instead  of  enqueuing 
itself  first  in  the  queue.  This  operation  requires  only  a  single  word  Compare&Swap 
instruction.  If  the  instruction  is  successful,  the  record     is  modified  to  point  to  the 
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Table  5.3.  Performance  comparison  of  ICSM-P  algorithm  for  lock  utilization  of  25%, 
50%,  and  75% 


Algo 

Task 

Lock  Utilization 

25% 

50% 

75% 

Bins 

ms) 

Bins(ms 

Bins(ms) 

10 

20 

10 

20 

30 

10 

20 

30 

40 

50 

MCS 

1 

97.46 

2.54 

91.46 

8.54 

0.07 

81.89 

18.02 

0.09 

2 

97.51 

2.49 

91.19 

8.81 

82.61 

17.31 

0.08 

3 

97.78 

2.22 

91.65 

8.35 

83.01 

16.92 

0.07 

4 

97.92 

2.08 

91.96 

8.14 

81.91 

18.05 

0.04 

PR 

1 

97.50 

2.50 

89.13 

10.80 

80.12 

16.56 

0.92 

2 

97.02 

2.98 

90.88 

9.12 

83.00 

16.56 

0.44 

3 

97.57 

2.43 

92.19 

7.81 

83.80 

16.02 

0.18 

4 

98.02 

1.98 

92.24 

7.76 

84.23 

15.77 

P1K 

1 

95.05 

4.95 

82.06 

17.33 

0.61 

65.97 

27.39 

5.98 

0.66 

2 

95.08 

4.92 

84.25 

15.75 

69.98 

26.03 

3.78 

0.21 

3 

95.62 

4.38 

86.15 

13.85 

72.96 

24.10 

2.94 

4 

99.63 

0.37 

99.63 

0.37 

99.56 

0.44 

P2K 

1 

92.64 

7.36 

75.68 

22.41 

1.91 

53.00 

28.82 

11.88 

4.87 

1.43 

2 

93.24 

6.76 

77.00 

22.22 

0.78 

59.46 

29.00 

8.77 

2.77 

3 

96.77 

3.23 

90.90 

9.10 

87.58 

11.68 

0.74 

4 

99.59 

0.41 

99.69 

0.31 

99.48 

0.52 
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Table  5.4.  Performance  comparison  of  ICSM-P  algorithm  for  lock  utilization  of  100% 


Algo 

Task 

Lock  Utilization  100% 

Bins(ms) 

10 

20 

30 

40 

50 

60 

70 

80 

90 

100 

MCS 

1 

63.39 

35.74 

0.87 

2 

63.84 

35.31 

0.85 

3 

64.22 

35.22 

0.56 

4 

64.38 

34.87 

0.75 

PR 

1 

61.88 

35.19 

2.57 

0.36 

2 

63.49 

34.84 

1.51 

0.16 

3 

65.41 

33.89 

0.70 

4 

66.70 

33.30 

P1K 

1 

31.34 

35.89 

16.82 

11.08 

4.63 

4.02 

2 

38.00 

37.92 

15.27 

7.19 

1.62 

3 

46.78 

35.89 

12.51 

4.92 

4 

99.29 

0.71 

P2K 

1 

17.08 

18.79 

9.04 

10.31 

10.16 

6.12 

3.63 

3.71 

4.19 

16.97 

2 

31.43 

28.18 

10.76 

12.10 

10.93 

6.60 

3 

84.54 

11.18 

4.28 

4 

99.52 

0.48 
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rest  of  the  queued  records.  The  higher  priority  task  then  changes  the  lock  to  point 
to  its  own  record,  and  enters  the  critical  section.  The  Compare&Swap  instruction 
can  fail  if  task  pc  sets  the  Dq  bit  instead,  in  preparation  for  a  commit.  If  another 
task  with  a  higher  priority  than  task  ph  succeeds  in  setting  the  Dq  bit  first,  then 
also  the  Compare&Swap  instruction  of  task  ph  fails.  Then,  task  ph  will  continue  to 
queue  its  record  as  in  the  PR-Lock  algorithm.  The  modified  algorithm  is  presented 
in  Figure  5.44.  The  modifications  are  marked  with  the  letter  M. 

The  commit_release  Jock  is  as  before,  except  that  the  task  pc  that  is  unsuccess- 
ful in  releasing  the  lock  will  wait  till  the  lock  pointer  L  is  changed.  This  is  shown  in 
Figure  5.45. 

We  expect  that  the  performance  of  the  modified  algorithm  will  be  similar  to  the 
ICSM-P  algorithm  discussed  before. 

5.5  Conclusions 

In  this  chapter,  we  extended  Interruptible  Critical  Sections  (ICS)  to  multiproces- 
sor systems.  We  show  how  ICS  can  be  implemented  under  different  environments  of 
a  multiprocessor  system. 

In  the  ICSM-R  algorithm  for  a  non-real  time  multiprocessor  system,  tasks  release 
the  lock  during  a  context-switch.  The  lock  is  re-acquired  by  the  task  during  its 
next  quantum,  and  the  critical  section  is  restarted,  if  necessary.  Our  experimental 
and  analytical  results  show  that,  except  for  a  high  critical  section  time  to  quantum 
time  ratio,  ICSM-R  algorithm  performs  better  than  an  algorithm  that  holds  the  lock 
during  the  time  the  task  experiences  a  context-switch.  To  implement  the  ICSM-R 
algorithm,  a  small  addition  to  the  context-switch  routine  of  the  operating  system  is 
necessary. 
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Procedure  acquireJock(L,  Sell)  { 

Success=FALSE; 

do  { 

Prev-Node=NULL;  Next_Node=NULL; 

if(CAS(4L,  &Next_Node,  ftSelf))  {  /*  Ho  Lock  */ 

Success=TRUE;  Failure=FALSE;  /*  Use  Lock  */ 

Self .Ptr->Priority  =  HAX_PRIORITY;  Sell . Ptr->Next . Dq=FALSE ; 

} 

else 

if  (Next  Jlode. Ptr->Priority  <  Self .Ptr->Priority) 

M 

while(NextJfode.Dq  ==  FALSE)  { 

M 

Temp-Node  =  Next-Node;  Temp -Node .  Dq  =  TRUE; 

M 

if(CAS(ftNext-Node.Ptr->Next,  &Next_Node,  &Temp-Node))  { 

M 

Self . Ptr->Next . Ptr  =  Next-Node. Ptr; 

M 

Success=TRUE;  Failure=FALSE ;  /*  Use  Lock  */ 

M 

Self .Ptr->Priority  =  MAX-PRIORITY; 

M 

Self . Ptr->Next .Dq=FALSE;  L.Ptr  =  Self. Ptr; 

M 

} 

M 

} 

M 

else  {  /*  Lock  in  Use  */ 

Failure=FALSE ;  Self . Ptr->Locked=TRUE ; 

do  { 

Prev_Node=Next_Node ;  Next_Node=Prev_Node .  Ptr->Next ; 

if  ((Next -Node. Dq==TRUE)  /*  Deque,  Try  Again  */ 

ii 

or  (Prev-Node . Ptr->Priority<Self . Ptr->Priority ) ) 

iii 

Failure=TRUE; 

else  { 

if  (Next -Node.  Ptr==NULL  or  (Next-Node.  Ptr  !=NULL  and 

Next-Node . Ptr->Priority<Self . Ptr->Priority ) ) { 

Self .  Ptr->Next .  Ptr=Next_Node .  Ptr 

if(CAS(ft(Prev_Node.Ptr->Next) ,  &Next_Node,  ftSelf))  { 

F 

Self . Ptr->Next . Dq=FALSE ; 

while(Self .Ptr->Locked) ;  /*  Busy  Wait  */ 

Success=TRUE;  /*  Then,  use  lock  */ 

} 

else  { 

if( (Next-Node. Dq==TRUE)  /*  Deque,  Try  Again  */ 

ii 

or  Prev_Node.Ptr->Priority  <  Self . Ptr->Priority) ) 

iii 

Failure=TRUE; 

else 

Next_Node=PrevJIode; 

-> 

} 

} 

}while( ! Success  and  IFailure); 

} 

}while( ! Success) ; 

} 

i 

Figure  5.44.  The  acquire  Jock  ICSM-P  with  single  word  CAS 


152 


Procedure  commit_release_lock(L,  Self,  Commit_Parameters) { 
Status  =  FAILURE; 
Old-Value  =  Self  .Ptr->Next ; 
while  (Self  .Ptr->Next.Dq  ==  FALSE) 

{ 

New_Value  =  01d_Value;  New.Value.Dq  =  TRUE; 
if(CAS(&Self  .Ptr->Next,  &01d_Value,  &New_Value)) 

{ 

Commit  the  critical  section  operation; 

L  =  Self  .Ptr->Next; 

if  (Self  .Ptr->Next  !=  NULL){ 

L .  Ptr->Priority=MAX_PRIORITY ; 

L .  Ptr->Locked=FALSE ; 

} 

Status  =  SUCCESS; 

} 

} 

if(Status  =  FAILURE) 

while  (L  ==  Self  .Ptr)  ; 
return(Status) ; 


Figure  5.45.  The  commit_release Jock  ICSM-P  with  single  word  CAS 
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ICSM-K  algorithm  prevents  a  possible  priority  inversion  to  a  high  priority  task 
that  shares  a  critical  section  with  a  set  of  low  priority  tasks.  The  resulting  increase  in 
the  response  time  of  the  critical  section  for  the  low  priority  tasks  depends  on  their  lock 
utilization.  This  algorithm  is  suitable  for  real-time  multiprocessor  systems.  ICSM- 
K  algorithm  uses  an  atomic  Compare&Swap  instruction  that  is  available  on  most 
shared-memory  multiprocessor  systems. 

In  the  ICSM-P  algorithm,  we  consider  a  real-time  multiprocessor  system  with 
shared-memory  and  a  set  of  tasks  with  varying  priorities.  ICSM-P  algorithm  reduces 
the  priority  inversion  possible  to  the  high  priority  tasks.  There  is  a  penalty  of  increase 
in  the  response  times  of  the  critical  section  for  the  low  priority  tasks.  ICSM-P 
algorithm  can  be  tuned  to  limit  this  increase  to  a  certain  degree.  ICSM-P  algorithm 
also  uses  an  atomic  Compare&Swap  instruction  for  its  implementation. 

As  in  the  uniprocessor  system,  using  ICS  in  a  multiprocessor  system  allows  the 
high  priority  task  to  always  complete  quickly.  In  a  real-time  system,  ICS  prevent 
priority  inversion.  In  addition,  the  ICS  mechanism  is  independent  of  the  scheduling 
algorithm. 


CHAPTER  6 
CONCLUSIONS 


Real  time  systems  are  becoming  increasingly  important.  It  is  becoming  necessary 
to  impose  deadlines  for  responses  to  every  action  (for  example,  message  transfers 
and  process  executions),  be  it  in  the  field  of  computer  networks,  database  systems, 
simulation,  AI,  uniprocessor  systems,  multiprocessor  systems  or  distributed  systems. 
Even  if  not  necessary,  we  can  find  applications  that  can  use  systems  that  respond 
within  deadlines. 

Synchronization  for  mutual  exclusion  is  an  important  issue  in  real-time  systems 
for  the  same  reasons  as  for  non-real  time  systems,  in  which  processes  cooperate 
for  using  a  shared  resource  in  an  orderly  way.  We  have  shown  in  Chapter  2  that 
existing  synchronization  mechanisms  for  non-real  time  systems  are  inadequate  for 
this  purpose. 

We  have  broadly  classified  synchronization  algorithms  depending  on  the  charac- 
teristics of  the  primitives:  pessimistic  and  optimistic.  In  pessimistic  synchronization, 
a  lock  or  a  queue  regulates  the  access  to  a  shared  resource  so  that  only  one  process 
uses  the  resource  at  a  time.  This  is  the  traditional  method  employed  but  it  is  overly 
restrictive  for  a  high  priority  process  in  a  real-time  system.  In  optimistic  synchroniza- 
tion, a  process  proceeds  to  use  the  shared  resource  with  the  assumption  that  no  other 
process  is  interfering  with  it.  In  the  case  that  such  an  event  occurs,  it  is  detected  and 
a  recovery  action  is  taken.  The  recovery  action  can  be  either  to  execute  the  critical 
section  that  accesses  the  resource  or  to  post  the  process's  modification  to  the  resource 
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so  that  other  processes  can  complete  the  operation.  We  developed  synchronization 
algorithms  using  each  of  these  techniques  that  perform  well  in  real-time  system. 

6.1    PR- Lock  Algorithm 

The  PR-Lock  algorithm  presented  in  Chapter  3  is  based  on  the  pessimistic  ap- 
proach. PR-Lock  is  a  prioritized  spin-lock  synchronization  algorithm  for  multipro- 
cessor system  with  shared-memory.  Processes  using  the  PR-Lock  wait  in  a  priority- 
ordered  queue.  We  have  shown  that  the  amount  of  time  to  acquire  the  lock  using 
PR-Lock  is  inversely  proportional  to  the  priority  of  a  process,  thus  making  PR-Lock 
suitable  for  real-time  systems.  Releasing  the  lock  is  a  constant  time  operation.  Prior- 
ity inversion  in  which  a  higher  priority  process  is  blocked  by  a  lower  priority  process 
can  still  occur  using  PR-Lock.  Simulation  results  show  that  the  PR-Lock  algorithm 
performs  well  in  practice. 

The  PR-Lock  algorithm  has  the  following  advantages  over  other  proposed  priori- 
tized spin-locks. 

•  The  algorithm  is  contention  free. 

•  A  higher  priority  process  does  not  have  to  work  for  a  lower  priority  process 
while  releasing  the  lock.  As  a  result,  the  time  to  acquire  and  release  the  lock  is 
fast  and  predictable. 

•  The  PR-Lock  has  a  well  defined  acquire-lock  point. 

•  The  PR-Lock  maintains  a  pointer  to  the  process  using  the  lock  that  facilitates 
implementing  priority  inheritance  protocols. 
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6.2    Interruptible  Critical  Sections  on  Uniprocessors 

We  presented  Interruptible  Critical  Sections  in  Chapter  4,  an  optimistic  approach 
to  synchronization  that  is  applicable  to  real-time  systems.  We  designed  two  synchro- 
nization algorithms  for  a  real-time  uniprocessor  system.  The  first  algorithm  is  based 
on  ICS.  The  other  algorithm,  the  Interruptible  Lock,  uses  a  combination  of  Interrupt- 
ible Critical  Sections  and  semaphore  locks.  Both  algorithms  reduce  the  variance  in 
the  response  time  of  the  highest  priority  task  with  only  a  small  impact  on  the  perfor- 
mance of  low  priority  tasks.  The  Interruptible  Lock  algorithm  further  improves  the 
performance  of  low  priority  tasks  over  the  algorithm  based  on  pure  ICS. 

We  further  show  how  Interruptible  Critical  Sections  can  be  combined  with  the 
Priority  Ceiling  Protocol,  and  present  an  analysis  which  shows  that  interruptible 
locks  improve  the  schedulability  of  task  sets  that  have  high  priority  tasks  with  tight 
deadlines. 

6.3    Interruptible  Critical  Sections  on  Multiprocessors 

We  extended  the  usage  of  Interruptible  Critical  Sections  to  shared-memory  mul- 
tiprocessors under  various  environments  in  Chapter  5. 

The  first  environment  considered  is  a  non-real  time  multiprocessor  system.  In 
the  ICSM-R  algorithm,  the  lock  is  released  if  a  process  experiences  a  context  switch. 
When  the  process  gains  control  of  the  processor,  the  lock  is  re-acquired,  and  if  there  is 
no  other  conflicting  execution  of  the  critical  section,  the  process  continues.  Otherwise, 
the  process  restarts  the  critical  section.  Our  experimental  results  show  that  ICSM-R 
performs  well  compared  to  an  algorithm  that  does  not  release  the  lock  under  various 
load  conditions.  We  also  present  an  analysis  that  it  is  advantageous  to  use  ICSM-R, 
except  for  a  high  critical  section  time  to  quantum  time  ratio. 
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In  the  real-time  environment,  we  first  consider  a  single  high  priority  task  with  a 
set  of  low  priority  tasks.  In  the  ICSM-K  algorithm,  the  high  priority  task  is  allowed 
to  access  the  critical  section  unconditionally.  Any  other  concurrent  low  priority  task 
executing  the  critical  section  detects  this  conflict  and  re-executes  the  critical  section. 
A  semaphore  lock  is  used  to  regulate  the  access  to  the  critical  section  among  the  low 
priority  tasks. 

In  ICSM-P  algorithm,  tasks  can  have  arbitrary  priorities.  The  ICSM-P  algorithm 
is  based  on  the  spin-lock  PR-Lock  algorithm  and  uses  a  priority  queue.  If  the  highest 
priority  process  waiting  for  a  lock  in  the  queue  finds  that  a  lower  priority  process  is 
using  the  lock,  it  enters  the  critical  section  itself.  The  low  priority  process  detects 
this  conflict  and  re-tries  the  lock  acquisition. 

Experimental  results  show  that  all  the  algorithms  perform  well  making  the  Inter- 
ruptible  Critical  Sections  mechanism  a  viable  approach  to  synchronization  in  real- 
time systems. 

6.4    Final  Words 

While  the  idea  of  synchronizing  optimistically  is  not  new,  it  is  not  being  used 
in  a  real-time  system  environment  .  Optimistic  concurrency  control  techniques  were 
proposed  in  database  systems  long  ago.  But,  recovery  in  a  database  system  is  a  costly 
operation. 

Contrary  to  expectations,  optimistic  synchronization  is  suitable  for  real-time  sys- 
tems. Critical  high  priority  tasks  have  faster  access  to  critical  sections  and  the  overall 
schedulability  of  a  set  of  processes  can  be  improved.  Lower  priority  processes  do  not 
suffer  many  re-executions  as  a  real-time  system  is  scheduled  conservatively,  not  ag- 
gressively. 
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